Temporal Exploratory Data Analysis¶
%load_ext autoreload
%autoreload 2
import os
import sys
import warnings
from calendar import day_name
from glob import glob
from itertools import product
from typing import Dict
import altair as alt
import duckdb
import pandas as pd
from scipy.stats import linregress
from watermark import watermark
_ = alt.data_transformers.disable_max_rows()
_ = alt.renderers.set_embed_options(actions=False)
PROJ_ROOT = os.pardir
src_dir = os.path.join(PROJ_ROOT, "src")
sys.path.append(src_dir)
%aimport file_utils
import file_utils as flut
%aimport open_data
import open_data as od
%aimport pandas_utils
import pandas_utils as pu
%aimport visualization_helpers
import visualization_helpers as vzu
About¶
Objective¶
The processed bike share ridership data is explored in order to understand
- high-level historical network performance
- attributes of the stations identified to be top-performing stations using historical data and to be prioritized in the advertising campaign that is part of the business use-case for this project
- user behaviour by studying temporal patterns in historical bike share ridership data
Data¶
The following previously-created datasets are used in this exploratory data analysis
- daily weather data (raw)
- list of downtown neighbourhoods (raw)
- station info (includes geodata, raw)
- network expansion plans (raw)
- public holidays in Canada (raw)
- processed bike share ridership (raw)
- identification of bike share stations as top-performers or not (processed)
Notes¶
- This is the first part of a two-part EDA for this project. The next step contains the second part of the EDA, in which geospatial patterns in the Bike Share Toronto network will be studied.
- Using temporal insights, recommendations will be made in order to maximize exposure while also minimizing costs, as required by the client and discussed in the project scope. See the project's proposal from the project wiki for details about the scope.
Assumptions¶
- Same as in data retrieval and processing steps.
Outputs¶
- Charts showing EDA are produced.
- Recommendations file with logic to filter currently active stations based on temporal insights.
User Inputs¶
# ridership
years_proc_trips = {
2018: [f'Q{k}' for k in range(1, 4+1)],
2019: [f'Q{k}' for k in range(1, 4+1)],
2020: [f'{str(k).zfill(2)}' for k in range(1, 12+1)],
2021: [f'{str(k).zfill(2)}' for k in range(1, 12+1)],
2022: [f'{str(k).zfill(2)}' for k in range(1, 12+1)],
2023: [f'{str(k).zfill(2)}' for k in range(1, 3+1)],
}
# exploring turning (inflection) point
years_temp_dependence = [2018, 2019, 2020, 2021, 2022]
# plotting
axis_label_fontsize = 14
# exporting to disk
my_timezone = 'America/Toronto'
data_dir = os.path.join(PROJ_ROOT, 'data')
raw_data_dir = os.path.join(data_dir, 'raw', 'systems', 'toronto')
processed_data_dir = os.path.join(data_dir, 'processed')
reports_dir = os.path.join(PROJ_ROOT, 'reports')
figures_dir = os.path.join(reports_dir, 'figures')
# processed trips
fpaths_proc = {
y: [
f
for p in periods
for f in sorted(
glob(
os.path.join(
processed_data_dir,
f'processed__trips_{y}_{p}*.parquet.gzip',
)
)
)
]
for y, periods in years_proc_trips.items()
}
fpaths_proc_all = [f for _, v in fpaths_proc.items() for f in v]
fpaths_proc_2018_2022 = [f for y in range(2018, 2022+1) for f in fpaths_proc[y]]
# downtown neighbourhoods
fpath_downtown_neighs = glob(
os.path.join(raw_data_dir, 'downtown_neighbourhoods__*.parquet.gzip')
)[0]
# expansion plans
fpath_expansion = glob(
os.path.join(raw_data_dir, 'network_expansion__*.parquet.gzip')
)[0]
# station info for currently active stations
fpath_stations_info = glob(
os.path.join(raw_data_dir, 'stations_info__*.parquet.gzip')
)[0]
# daily weather data
fpath_weather = glob(
os.path.join(raw_data_dir, 'daily_weather__*.parquet.gzip')
)[0]
# holidays
fpath_holidays = glob(
os.path.join(raw_data_dir, 'holidays__*.parquet.gzip')
)[0]
# top performing stations
fpath_top_stations = glob(
os.path.join(processed_data_dir, 'stations_performance__*.parquet.gzip')
)[0]
def run_sql_query(query: str, verbose: bool=False) -> pd.DataFrame:
"""Run SQL query using DuckDB."""
with warnings.catch_warnings():
warnings.simplefilter("ignore", FutureWarning)
df_query = duckdb.sql(query).df()
if verbose:
print(f"Query returned {len(df_query):,} rows")
return df_query
Get Data¶
Daily Weather Data¶
Show the previously retrieved daily weather data
%%time
query = f"""
SELECT *
FROM read_parquet({[fpath_weather]})
"""
df_weather = run_sql_query(query).convert_dtypes()
with pd.option_context('display.max_columns', None):
pu.show_df(df_weather)
| column | time | tavg | tmin | tmax | prcp | snow | wdir | wspd | wpgt | pres | tsun | station_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dtype | datetime64[us] | Float64 | Float64 | Float64 | Float64 | Int64 | Int64 | Float64 | Int64 | Float64 | Int64 | string[python] |
| nunique | 1915 | 416 | 404 | 440 | 150 | 29 | 355 | 286 | 59 | 347 | 0 | 1 |
| missing | 0 | 0 | 0 | 0 | 7 | 1448 | 75 | 9 | 721 | 15 | 1915 | 0 |
| 0 | 2018-01-01 | -15.0 | -21.3 | -8.7 | 0.0 | 110 | <NA> | 19.5 | <NA> | 1029.1 | <NA> | 71624 |
| 1 | 2018-01-02 | -10.5 | -13.1 | -7.8 | 1.0 | 110 | 241 | 25.6 | <NA> | 1025.0 | <NA> | 71624 |
| 2 | 2018-01-03 | -9.9 | -13.5 | -6.3 | 0.0 | 100 | 235 | 23.2 | <NA> | 1017.4 | <NA> | 71624 |
| 3 | 2018-01-04 | -14.7 | -20.5 | -8.9 | 0.0 | 90 | 314 | 27.5 | <NA> | 1013.8 | <NA> | 71624 |
| 4 | 2018-01-05 | -19.0 | -23.0 | -15.0 | 0.0 | 80 | 311 | 26.7 | <NA> | 1019.1 | <NA> | 71624 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1910 | 2023-03-27 | 2.9 | 2.0 | 5.0 | 4.9 | <NA> | 55 | 9.4 | <NA> | 1018.0 | <NA> | 71624 |
| 1911 | 2023-03-28 | 3.5 | 1.0 | 6.0 | 0.1 | <NA> | 234 | 10.9 | <NA> | 1021.9 | <NA> | 71624 |
| 1912 | 2023-03-29 | 1.5 | -3.9 | 8.0 | 1.6 | <NA> | 250 | 24.6 | <NA> | 1018.4 | <NA> | 71624 |
| 1913 | 2023-03-30 | -1.3 | -6.5 | 4.0 | 0.0 | <NA> | 251 | 14.0 | <NA> | 1026.8 | <NA> | 71624 |
| 1914 | 2023-03-31 | 2.4 | 0.0 | 4.4 | 8.3 | <NA> | 108 | 9.9 | <NA> | 1015.0 | <NA> | 71624 |
1915 rows × 12 columns
CPU times: user 23.1 ms, sys: 0 ns, total: 23.1 ms Wall time: 22.7 ms
Downtown Neighbourhoods¶
Show previously retrieved neighbourhoods within downtown Toronto
df_downtown_neighs = pd.read_parquet(fpath_downtown_neighs)
pu.show_df(df_downtown_neighs)
| column | Neighbourhood | Location | is_downtown |
|---|---|---|---|
| dtype | string[python] | string[python] | boolean |
| nunique | 24 | 3 | 2 |
| missing | 0 | 0 | 0 |
| 0 | University | Downtown | True |
| 1 | Kensington-Chinatown | Downtown | True |
| 2 | Wellington Place | Downtown | True |
| 3 | Harbourfront-CityPlace | Downtown | True |
| 4 | Bay-Cloverhill | Downtown | True |
| 5 | Yonge-Bay Corridor | Downtown | True |
| 6 | St Lawrence-East Bayfront-The Islands | Downtown | True |
| 7 | Church-Wellesley | Downtown | True |
| 8 | Downtown Yonge East | Downtown | True |
| 9 | North St.James Town | Downtown | True |
| 10 | Cabbagetown-South St.James Town | Downtown | True |
| 11 | Moss Park | Downtown | True |
| 12 | Regent Park | Downtown | True |
| 13 | Roncesvalles | West of Downtown | False |
| 14 | South Parkdale | West of Downtown | False |
| 15 | Dufferin Grove | West of Downtown | False |
| 16 | Little Portugal | West of Downtown | False |
| 17 | Palmerston-Little Italy | West of Downtown | False |
| 18 | Trinity-Bellwoods | West of Downtown | False |
| 19 | West Queen West | West of Downtown | False |
| 20 | Fort York-Liberty Village | West of Downtown | False |
| 21 | North Riverdale | East of Downtown | False |
| 22 | South Riverdale | East of Downtown | False |
| 23 | Blake-Jones | East of Downtown | False |
Bike Share Station Info (MetaData)¶
Show the stations info data that was retrieved previously, containing station name and its associated neighbourhood name
%%time
query = f"""
SELECT station_id,
name,
physical_configuration,
capacity,
is_charging_station,
rental_methods LIKE '%CREDITCARD%' AS credit,
Neighbourhood,
COALESCE(Location, NULL, 'Others') AS Location,
COALESCE(is_downtown, NULL, False) AS is_downtown,
census_tract_id
FROM read_parquet({[fpath_stations_info]})
LEFT JOIN df_downtown_neighs USING (Neighbourhood)
-- WHERE physical_configuration <> 'VAULT'
ORDER BY station_id, name
"""
df_info = run_sql_query(query).convert_dtypes()
with pd.option_context('display.max_columns', None):
pu.show_df(df_info)
| column | station_id | name | physical_configuration | capacity | is_charging_station | credit | Neighbourhood | Location | is_downtown | census_tract_id |
|---|---|---|---|---|---|---|---|---|---|---|
| dtype | string[python] | string[python] | string[python] | Int64 | boolean | boolean | string[python] | string[python] | boolean | string[python] |
| nunique | 790 | 790 | 6 | 40 | 2 | 2 | 107 | 4 | 2 | 272 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 7000 | Fort York Blvd / Capreol Ct | REGULAR | 35 | False | True | Harbourfront-CityPlace | Downtown | True | 5350012.01 |
| 1 | 7001 | Wellesley Station Green P | ELECTRICBIKESTATION | 23 | True | True | Church-Wellesley | Downtown | True | 5350063.06 |
| 2 | 7002 | St. George St / Bloor St W | REGULAR | 19 | False | True | University | Downtown | True | 5350061.00 |
| 3 | 7003 | Madison Ave / Bloor St W | REGULAR | 15 | False | True | Annex | Others | False | 5350091.01 |
| 4 | 7005 | King St W / York St | REGULAR | 23 | False | True | Yonge-Bay Corridor | Downtown | True | 5350014.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 785 | 7926 | McRae Dr / Laird Dr - SMART | SMARTMAPFRAME | 24 | False | False | Leaside-Bennington | Others | False | 5350195.02 |
| 786 | 7927 | Strachan Ave / East Liberty St - SMART | SMARTMAPFRAME | 24 | False | False | Fort York-Liberty Village | West of Downtown | False | 5350008.01 |
| 787 | 7928 | Simcoe St / Pullan Pl | REGULAR | 31 | False | True | Kensington-Chinatown | Downtown | True | 5350036.00 |
| 788 | 7929 | Spadina Ave / Bulwer St- SMART | SMARTMAPFRAME | 12 | False | False | Kensington-Chinatown | Downtown | True | 5350039.00 |
| 789 | 7930 | Dunn Ave / Queen St W | REGULAR | 15 | False | True | South Parkdale | West of Downtown | False | 5350005.00 |
790 rows × 10 columns
CPU times: user 23.3 ms, sys: 0 ns, total: 23.3 ms Wall time: 20.4 ms
Bike Share Network Expansion Plans¶
Show previously retrieved bike share network expansion plans
%%time
query = f"""
SELECT *,
NULL AS frac_ctracts_with_bikeshare
FROM read_parquet({[fpath_expansion]})
"""
df_network_size = run_sql_query(query).convert_dtypes()
with pd.option_context('display.max_columns', None):
pu.show_df(df_network_size)
| column | year | trips | num_stations | num_bikes | frac_neighs_with_bikeshare | frac_ctracts_with_bikeshare |
|---|---|---|---|---|---|---|
| dtype | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 |
| nunique | 3 | 3 | 3 | 3 | 0 | 0 |
| missing | 0 | 0 | 0 | 0 | 3 | 3 |
| 0 | 2023 | 5800000 | 820 | 8110 | <NA> | <NA> |
| 1 | 2024 | 7000000 | 930 | 9055 | <NA> | <NA> |
| 2 | 2025 | 8200000 | 1045 | 10000 | <NA> | <NA> |
CPU times: user 9.5 ms, sys: 1.12 ms, total: 10.6 ms Wall time: 9.94 ms
Public Holidays¶
Show the previously retrieved list of public holidays in Canada
%%time
query = f"""
SELECT *,
(
CASE WHEN DAYNAME(date) IN ('Sunday', 'Saturday')
THEN 'Monday'
ELSE DAYNAME(date)
END
) AS weekday
FROM read_parquet({[fpath_holidays]})
"""
df_holidays = run_sql_query(query).convert_dtypes()
with pd.option_context('display.max_columns', None):
pu.show_df(df_holidays)
| column | date | holiday_name | is_holiday | weekday |
|---|---|---|---|---|
| dtype | datetime64[us] | string[python] | boolean | string[python] |
| nunique | 48 | 13 | 1 | 5 |
| missing | 0 | 0 | 0 | 0 |
| 0 | 2018-01-01 | New Year's Day | True | Monday |
| 1 | 2018-03-30 | Good Friday | True | Friday |
| 2 | 2018-07-01 | Canada Day | True | Monday |
| 3 | 2018-09-03 | Labor Day | True | Monday |
| 4 | 2018-12-25 | Christmas Day | True | Tuesday |
| 5 | 2018-02-19 | Family Day | True | Monday |
| 6 | 2018-05-21 | Victoria Day | True | Monday |
| 7 | 2018-10-08 | Thanksgiving Day | True | Monday |
| 8 | 2018-12-26 | Boxing Day | True | Wednesday |
| 9 | 2019-01-01 | New Year's Day | True | Tuesday |
| 10 | 2019-04-19 | Good Friday | True | Friday |
| 11 | 2019-07-01 | Canada Day | True | Monday |
| 12 | 2019-09-02 | Labor Day | True | Monday |
| 13 | 2019-12-25 | Christmas Day | True | Wednesday |
| 14 | 2019-02-18 | Family Day | True | Monday |
| 15 | 2019-05-20 | Victoria Day | True | Monday |
| 16 | 2019-10-14 | Thanksgiving Day | True | Monday |
| 17 | 2019-12-26 | Boxing Day | True | Thursday |
| 18 | 2020-01-01 | New Year's Day | True | Wednesday |
| 19 | 2020-04-10 | Good Friday | True | Friday |
| 20 | 2020-07-01 | Canada Day | True | Wednesday |
| 21 | 2020-09-07 | Labor Day | True | Monday |
| 22 | 2020-12-25 | Christmas Day | True | Friday |
| 23 | 2020-02-17 | Family Day | True | Monday |
| 24 | 2020-05-18 | Victoria Day | True | Monday |
| 25 | 2020-10-12 | Thanksgiving Day | True | Monday |
| 26 | 2020-12-26 | Boxing Day | True | Monday |
| 27 | 2021-01-01 | New Year's Day | True | Friday |
| 28 | 2021-04-02 | Good Friday | True | Friday |
| 29 | 2021-07-01 | Canada Day | True | Thursday |
| 30 | 2021-09-06 | Labor Day | True | Monday |
| 31 | 2021-12-25 | Christmas Day | True | Monday |
| 32 | 2021-12-27 | Christmas Day (observed) | True | Monday |
| 33 | 2021-02-15 | Family Day | True | Monday |
| 34 | 2021-05-24 | Victoria Day | True | Monday |
| 35 | 2021-10-11 | Thanksgiving Day | True | Monday |
| 36 | 2021-12-26 | Boxing Day | True | Monday |
| 37 | 2021-12-28 | Boxing Day (observed) | True | Tuesday |
| 38 | 2022-01-01 | New Year's Day | True | Monday |
| 39 | 2022-01-03 | New Year's Day (observed) | True | Monday |
| 40 | 2022-04-15 | Good Friday | True | Friday |
| 41 | 2022-07-01 | Canada Day | True | Friday |
| 42 | 2022-09-05 | Labor Day | True | Monday |
| 43 | 2022-12-25 | Christmas Day | True | Monday |
| 44 | 2022-12-26 | Boxing Day; Christmas Day (observed) | True | Monday |
| 45 | 2022-02-21 | Family Day | True | Monday |
| 46 | 2022-05-23 | Victoria Day | True | Monday |
| 47 | 2022-10-10 | Thanksgiving Day | True | Monday |
CPU times: user 6.96 ms, sys: 3.23 ms, total: 10.2 ms Wall time: 9.74 ms
Top-Performing Stations¶
Show previously classified stations as top-performers or not
df_stations = pd.read_parquet(fpath_top_stations)
pu.show_df(df_stations)
| column | station_id | name | physical_configuration | capacity | is_charging_station | credit | Neighbourhood | Location | census_tract_id | is_active | ... | is_top_perform_station_weekday | departures_weekend_last_year | arrivals_weekend_last_year | departures_weekend_last_n_years | arrivals_weekend_last_n_years | rank_weekend_deps_last_year | rank_weekend_deps_last_n_years | rank_weekend_arrs_last_year | rank_weekend_arrs_last_n_years | is_top_perform_station_weekend |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dtype | Int64 | string[python] | string[python] | Int64 | boolean | boolean | string[python] | string[python] | string[python] | boolean | ... | boolean | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | boolean |
| nunique | 627 | 627 | 6 | 39 | 2 | 2 | 83 | 4 | 214 | 1 | ... | 2 | 570 | 576 | 601 | 598 | 570 | 601 | 576 | 598 | 2 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 7076 | York St / Queens Quay W | REGULAR | 57 | False | True | Harbourfront-CityPlace | Downtown | 5350012.04 | True | ... | True | 14168 | 16020 | 49454 | 54606 | 1 | 2 | 1 | 1 | True |
| 1 | 7016 | Bay St / Queens Quay W (Ferry Terminal) | REGULAR | 35 | False | True | St Lawrence-East Bayfront-The Islands | Downtown | 5350013.02 | True | ... | True | 10548 | 10835 | 40142 | 42497 | 5 | 5 | 5 | 5 | True |
| 2 | 7033 | Union Station | REGULAR | 43 | False | True | St Lawrence-East Bayfront-The Islands | Downtown | 5350013.01 | True | ... | True | 6026 | 7354 | 17353 | 20216 | 35 | 46 | 20 | 33 | True |
| 3 | 7175 | HTO Park (Queens Quay W) | REGULAR | 27 | False | True | Harbourfront-CityPlace | Downtown | 5350012.04 | True | ... | True | 11206 | 13941 | 41840 | 47634 | 3 | 4 | 2 | 3 | True |
| 4 | 7203 | Bathurst St/Queens Quay(Billy Bishop Airport) | REGULAR | 35 | False | True | Fort York-Liberty Village | West of Downtown | 5350008.02 | True | ... | True | 8970 | 10015 | 38856 | 41360 | 8 | 6 | 8 | 6 | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 622 | 7156 | Salem Ave / Bloor St W | REGULAR | 15 | False | True | Dovercourt Village | Others | 5350096.02 | True | ... | False | 1655 | 1574 | 6196 | 5856 | 276 | 232 | 282 | 246 | False |
| 623 | 7600 | Ursula Franklin St / Huron St - SMART | SMARTMAPFRAME | 20 | False | False | University | Downtown | 5350061.00 | True | ... | False | 953 | 740 | 1605 | 1254 | 390 | 467 | 423 | 486 | False |
| 624 | 7414 | Keele St / Annette St | REGULAR | 15 | False | True | Junction Area | Others | 5350101.00 | True | ... | False | 694 | 687 | 2255 | 2283 | 442 | 412 | 431 | 401 | False |
| 625 | 7622 | Marie Curtis Park | REGULAR | 23 | False | True | Long Branch | Others | 5350206.01 | True | ... | False | 741 | 756 | 2248 | 2347 | 434 | 413 | 420 | 398 | False |
| 626 | 7694 | Victoria Park Ave / Dawes Rd | REGULAR | 19 | False | True | O'Connor-Parkview | Others | 5350190.02 | True | ... | False | 133 | 130 | 133 | 130 | 583 | 594 | 584 | 595 | False |
627 rows × 38 columns
Processed Bike Share Ridership Data¶
Show the first three rows of the file of processed bike share ridership for August of 2022
%%time
query = f"""
SELECT *
FROM read_parquet({[fpaths_proc[2022][7]]})
WHERE started_at_year = 2022
AND started_at_month = 8
LIMIT 3
"""
df_proc_trips_preview = run_sql_query(query).convert_dtypes()
pu.show_df(df_proc_trips_preview)
| column | trip_id | start_station_id | started_at | start_station_name | end_station_id | ended_at | end_station_name | bike_id | user_type | started_at_year | started_at_month | started_at_day | started_at_hour | started_at_minute | ended_at_year | ended_at_month | ended_at_day | ended_at_hour | ended_at_minute |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dtype | Int64 | Int32 | datetime64[us] | string[python] | Int64 | datetime64[us] | string[python] | Int64 | string[python] | Int32 | Int32 | Int32 | Int32 | Int32 | Int32 | Int32 | Int32 | Int32 | Int32 |
| nunique | 3 | 2 | 1 | 2 | 2 | 2 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 17515458 | 7259 | 2022-08-01 | Lower Spadina Ave / Lake Shore Blvd | 7712 | 2022-08-01 00:20:00 | <NA> | 3328 | Casual Member | 2022 | 8 | 1 | 0 | 0 | 2022 | 8 | 1 | 0 | 20 |
| 1 | 17515440 | 7208 | 2022-08-01 | 80 Clinton St North of College | 7534 | 2022-08-01 00:07:00 | Walnut Ave / Queen St W | 4662 | Casual Member | 2022 | 8 | 1 | 0 | 0 | 2022 | 8 | 1 | 0 | 7 |
| 2 | 17515442 | 7259 | 2022-08-01 | Lower Spadina Ave / Lake Shore Blvd | 7712 | 2022-08-01 00:20:00 | <NA> | 4510 | Casual Member | 2022 | 8 | 1 | 0 | 0 | 2022 | 8 | 1 | 0 | 20 |
CPU times: user 48.9 ms, sys: 558 µs, total: 49.4 ms Wall time: 48.7 ms
Notes
- The above contents come from the file containing processed bike share ridership for August of 2022. The file with processed data for all other months (2020, 2021, 2022 and 2023) and quarters (2018, 2019) contain the same column names.
Show Overall Network Performance¶
Get a high-level summary of historical station performance and future plans by calculating the following metrics
- total ridership (number of trips) across entire
- total number of bikes used in ridership across entire network
- total number of stations used in ridership across entire network
- fraction of all neighbourhoods with at least one bike share station
using the following approach
- get stations and bikes used in yearly departures, and number of departures (used as trips), between 2018 and 2022
- get stations and bikes used in yearly arrivals between 2018 and 2022
- Get yearly totals for departures (1)
- Get yearly totals for arrivals (2)
- Combine yearly totals for departures (4) and arrivals (5) using
UNION - Get yearly totals (from 5) for
- number of trips taken (departures)
- bikes used in historical bike share trips
- stations used in historical bike share trips
- Combine overall historical totals (6) with network's future expansion plans
- Add
datecolumns showing the year covering the start and end of future expansion plans
%%time
query = f"""
-- 1. get pre-2023 yearly departure-related totals (up to Dec 31, 2022) per station
WITH t1 AS (
SELECT start_station_id AS station_id,
started_at_year AS year,
COUNT(DISTINCT(trip_id)) AS trips,
COUNT(DISTINCT(bike_id)) AS bikes_deps
FROM read_parquet({fpaths_proc_all})
WHERE started_at_year <= 2022
-- (OPTIONAL) exclude trips missing a start station name
-- WHERE start_station_name IS NOT NULL
GROUP BY all
),
-- 2. get pre-2023 yearly arrival-related totals (up to Dec 31, 2022) per station
t2 AS (
SELECT end_station_id AS station_id,
ended_at_year AS year,
COUNT(DISTINCT(bike_id)) AS bikes_arrs
FROM read_parquet({fpaths_proc_all})
WHERE started_at_year <= 2022
-- (OPTIONAL) exclude trips missing a end station name
-- WHERE end_station_name IS NOT NULL
GROUP BY all
),
-- 3. Aggregate pre-2023 yearly departures to get the following
-- count the stations with departures
-- count the number of bikes used for arrivals
-- sum the number of departures (trips)
t3 AS (
SELECT year,
COUNT(distinct(station_id)) AS num_stations,
MAX(bikes_deps) AS bikes,
SUM(trips) AS trips,
'departures' AS type
FROM t1
GROUP BY year
),
-- 4. Aggregate pre-2023 yearly arrivals to get the following
-- count the stations with arrivals
-- count the number of bikes used for arrivals
t4 AS (
SELECT year,
COUNT(distinct(station_id)) AS num_stations,
MAX(bikes_arrs) AS bikes,
NULL AS trips,
'arrivals' AS type
FROM t2
GROUP BY year
),
-- 5. combine pre-2023 counts for departures and arrivals
t5 AS (
SELECT *
FROM t3
UNION ALL
SELECT *
FROM t4
),
-- 6. get overal trips, stations, and bikes with bike share trips
t6 AS (
SELECT year,
SUM(trips) AS trips,
MAX(num_stations) AS num_stations,
MAX(bikes) AS num_bikes
FROM t5
GROUP BY ALL
ORDER BY year
),
-- 7. combine the yearly totals with the network expansion plans starting in 2023
t7 AS (
SELECT *
FROM t6
WHERE year < 2023
UNION
SELECT year,
trips,
num_stations,
num_bikes
FROM df_network_size
WHERE year >= 2023
ORDER BY year
),
-- 8. add indicator of future expansion plans
t8 AS (
SELECT *,
2023 AS shading_start,
2025 AS shading_stop
FROM t7
)
SELECT *
FROM t8
"""
df_yearly_summary = run_sql_query(query).convert_dtypes()
pu.show_df(df_yearly_summary)
| column | year | trips | num_stations | num_bikes | shading_start | shading_stop |
|---|---|---|---|---|---|---|
| dtype | Int64 | Int64 | Int64 | Int64 | Int32 | Int32 |
| nunique | 8 | 8 | 8 | 8 | 1 | 1 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 2018 | 1830573 | 359 | 0 | 2023 | 2025 |
| 1 | 2019 | 2326479 | 469 | 4787 | 2023 | 2025 |
| 2 | 2020 | 2739833 | 611 | 6417 | 2023 | 2025 |
| 3 | 2021 | 3435503 | 628 | 6120 | 2023 | 2025 |
| 4 | 2022 | 4514466 | 684 | 6523 | 2023 | 2025 |
| 5 | 2023 | 5800000 | 820 | 8110 | 2023 | 2025 |
| 6 | 2024 | 7000000 | 930 | 9055 | 2023 | 2025 |
| 7 | 2025 | 8200000 | 1045 | 10000 | 2023 | 2025 |
CPU times: user 21.7 s, sys: 1.35 s, total: 23.1 s Wall time: 2.36 s
Notes
- In order to count the number of pre-2023 stations, the aggregations are counting the station ID columns (
start_station_idorend_station_id) and are not using the station name columns (start_station_nameorend_station_name). Dropping trips missing a start or end station name will lead to a difference between the analyzed and published yearly totals for the number of stations used. At this stage in the analysis, this SQL query is only creating a high-level overview of pre-2023 system performance using the raw data and planned expansion (starting from 2023). Additionally, in order to count the number of bikes, the station name column is not necessary since only thebike_idcolumn can be used. So here too, dropping trips missing the station's name will lead to a difference compared to the published totals for the number of bikes used. For these two reasons, it was decided to keep trips that were missing a station name in these high-level totals.
Show charts of the overall
- trips
- number of stations and bikes used in historical ridership and planned to be available during the network expansion period
%%time
chart, charts_dict = vzu.plot_line_chart_grid(
df=df_yearly_summary,
xvar='year:O',
yvars=["trips:Q", 'num_stations:Q', 'num_bikes:Q'],
xvar_areas="shading_start:O",
xvar_areas2="shading_stop:O",
color_by_col_areas = 'index:N',
areas_opacity = 0.1,
axis_label_color = '#636363',
axis_tick_label_color = 'grey',
annotation_text_color = 'grey',
ptitles={
'trips': "Rapily growing ridership with nearly 5M trips in 2022",
'num_stations': 'Strong yearly expansion of network footprint to resume in 2023',
'num_bikes': 'Yearly expansion of bicycle inventory to restart in 2023',
},
ptitles_x_locs={'trips': 45, 'num_stations': 65, 'num_bikes': 70},
ytitles={
'trips': 'Trips', 'num_stations': 'Stations', 'num_bikes': 'Bicycles'
},
line_color = 'darkgreen',
marker_size = 60,
marker_fill = "white",
marker_edge_color = 'darkgreen',
annotation_text_y_loc = 140,
annotation_text_opacity = 0.85,
concat_spacing=10,
axis_label_opacity=0.7,
axis_label_fontsize=14,
fig_size=dict(width=675, height=300),
)
chart.save(os.path.join(figures_dir, '01_high_level_summary.png'))
for k,v in charts_dict.items():
chart = vzu.configure_chart(charts_dict[k], axis_label_fontsize)
chart.save(os.path.join(figures_dir, f'01_high_level_summary__{k}.png'))
CPU times: user 890 ms, sys: 25.5 ms, total: 916 ms Wall time: 812 ms
%%time
chart = vzu.plot_line_charts_with_shaded_area(
pd.concat(
[
df_yearly_summary.query("year <= 2022"),
df_yearly_summary.query("year >= 2023").drop(columns=['trips'])
]
),
xvar="year:O",
yvar="trips:Q",
yvar2="num_stations:Q",
xvar_areas="shading_start:O",
xvar_areas2="shading_stop:O",
color_by_col_areas = 'index:N',
y_max = 9_000_000,
xtitle = 'Trips (solid)',
ytitle = 'Stations (dashed)',
axis_label_fontsize = axis_label_fontsize,
ptitle = alt.TitleParams(
"Nearly 5M trips in 2022 and Expanding the Network's Footprint",
anchor='start',
dx=45,
fontSize=axis_label_fontsize
),
annotation_text = 'Four-Year Growth Plan',
annotation_text_color = 'grey',
annotation_text_opacity = 0.85,
annotation_text_x_loc = 160,
annotation_text_y_loc = 125,
annotation_text_solid=dict(dx=175, dy=75),
annotation_text_dashed=dict(dx=330, dy=-40),
areas_opacity = 0.1,
axis_label_color = 'darkgreen',
axis_tick_label_color = 'darkgreen',
axis2_label_color = 'green',
axis2_tick_label_color = 'green',
line_color = 'darkgreen',
line_color2 = '#a1d99b',
marker_size = 60,
marker_fill = "white",
marker_edge_color = 'darkgreen',
marker_edge_color2 = '#a1d99b',
fig_size = dict(width=675, height=300),
)
chart
CPU times: user 43.5 ms, sys: 3 µs, total: 43.5 ms Wall time: 43.1 ms
Observations
- Network expansion, in terms of bikes and stations added, was minimal during 2021 and also minimal in 2022. So, the dashed line slope is relatively small between 2021 and 2020 and between 2022 and 2021. However, the future expansion plans are similar to pre-2020 growth.
- Regardless of expansion, bikeshare ridership continues to show strong growth on a yearly basis.
- Overall, the network's usage and footprint are rapidly growing.
%%time
chart = vzu.plot_line_charts_with_shaded_area(
df_yearly_summary.query('year != 2018'),
xvar="year:O",
yvar="num_stations:Q",
yvar2="num_bikes:Q",
xvar_areas="shading_start:O",
xvar_areas2="shading_stop:O",
color_by_col_areas = 'index:N',
y_max = 1_200,
xtitle = 'Stations (solid)',
ytitle = 'Bicycles (dashed)',
axis_label_fontsize = axis_label_fontsize,
ptitle = alt.TitleParams(
"Located in Half the City's Neighbourhoods and Expanding Bicycle Inventory",
anchor='start',
dx=65,
fontSize=axis_label_fontsize
),
annotation_text = 'Four-Year Growth Plan',
annotation_text_color = 'grey',
annotation_text_opacity = 0.85,
annotation_text_x_loc = 170,
annotation_text_y_loc = 125,
annotation_text_solid=dict(dx=200, dy=5),
annotation_text_dashed=dict(dx=280, dy=-50),
areas_opacity = 0.1,
axis_label_color = 'darkgreen',
axis_tick_label_color = 'darkgreen',
axis2_label_color = 'green',
axis2_tick_label_color = 'green',
line_color = 'darkgreen',
line_color2 = '#a1d99b',
marker_size = 60,
marker_fill = "white",
marker_edge_color = 'darkgreen',
marker_edge_color2 = '#a1d99b',
fig_size = dict(width=675, height=300),
)
chart
CPU times: user 39.1 ms, sys: 4.32 ms, total: 43.5 ms Wall time: 42.8 ms
Observations
- The slowdown in network expansion during 2021 and 2022 is also visible in the number of bikes available in the network on a yearly basis. In this chart, the number of bikes used in bike share trips across the network is shown. All bikes available might not have been used in the ridership data after it was processed in the previous step. For this reason, the number of stations shows a drop in 2022.
- Although the number of neighbourhoods served by the service is flat during the three years from 2020 to 2022, expansion is planned to target new neighbourhoods (not shown in the chart).
All charts shown above display totals based on processed bike share ridership data that was performed in the previous step. These totals will differ from totals published elsewhere due to differences in data processing.
Summary of Metrics Used
- total ridership (number of trips) across entire network
- total number of bikes used in ridership across entire network
- total number of stations used in ridership across entire network
Explore Attributes of Stations by Station Type¶
The station types are
- Top-Performing stations
- Other stations (excluding top-performers)
Get the Fraction of Stations Located in Downtown Toronto¶
- From the ranked stations, count number of stations by station type and location type (whether station is located downtown or not)
- reshape into untidy data
- calculate fraction of top-performing stations that are located downtown
%%time
query = f"""
WITH t1 AS (
-- 1. count number of stations by station and location type
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'Top Performer'
ELSE 'Others' END
) AS station_type,
-- append conditional to indicate if station is located in a downtown neighbourhood
(
CASE WHEN is_downtown = True
THEN 'Downtown'
ELSE 'Other_Locations' END
) AS location_type,
-- count the number of stations that are and are not located in a downtown
-- neighbourhood
COUNT(DISTINCT(station_id)) AS num_stations
FROM df_stations
GROUP BY ALL
),
-- 2. reshape using pivot table
t2 AS (
PIVOT t1
ON location_type
USING sum(num_stations)
)
-- 3. calculate fraction of stations that are and are not located in downtown neighbourhoods
SELECT *,
Downtown+Other_Locations AS total,
100*Downtown/total AS frac_downtown
FROM t2
"""
df_downtown_query = run_sql_query(query).convert_dtypes()
df_downtown_query
CPU times: user 29 ms, sys: 0 ns, total: 29 ms Wall time: 17.3 ms
| station_type | Downtown | Other_Locations | total | frac_downtown | |
|---|---|---|---|---|---|
| 0 | Others | 137 | 392 | 529 | 25.897921 |
| 1 | Top Performer | 76 | 22 | 98 | 77.55102 |
Show above as a chart
%%time
chart = vzu.plot_bar_chart(
df_downtown_query,
xvar='frac_downtown:Q',
yvar='station_type:N',
xtitle='Downtown Stations (%)',
y_axis_sort=['Top Performer', 'Others'],
color_by_col='station_type:N',
color_labels=['Top Performer', 'Others'],
color_values=['darkgreen', 'lightgrey'],
ptitle=alt.TitleParams(
'Nearly 3X Top-Performers are Downtown as Others',
anchor='start',
dx=100,
fontSize=axis_label_fontsize
),
axis_label_fontsize=axis_label_fontsize,
fig_size=dict(width=450, height=125),
)
chart.save(
os.path.join(figures_dir, '02_top_performers_downtown.png')
)
chart
CPU times: user 65.2 ms, sys: 6.22 ms, total: 71.4 ms Wall time: 41.2 ms
Observations
- Nearly 80% of the top-performing stations are located downtown. Most bike share systems start by setting up the majority of their stations in the downtown or financial district of the city. The same is true for Bike Share Toronto. A large density of number of office buildings and the city's financial district are located in Downtown Toronto. By locating the majority of its top-peforming stations downtown as of the end of 2022, the service continues have strong support for working professionals.
Summary of Metrics Used
- fraction of bike share stations located in downtown Toronto
Get the Fraction of Stations Located In on Near Downtown Neighbourhoods¶
- From the ranked stations, count number of stations by station type and neighbourhood, for neighbourhoods close to Downtown Toronto
- reshape into untidy data
- calculate fraction of top-performing stations by neighbourhood
- reshape into tidy data
%%time
query = f"""
WITH t1 AS (
-- 1. count number of stations by station and neighbourhood
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'Top Performer'
ELSE 'Others' END
) AS station_type,
-- append conditional to indicate if station is located in a downtown neighbourhood
REPLACE(location, ' ', '_') AS location,
-- count the number of stations that are and are not located in a downtown
-- neighbourhood
COUNT(DISTINCT(station_id)) AS num_stations
FROM df_stations
GROUP BY ALL
),
-- 2. reshape using pivot table
t2 AS (
PIVOT t1
ON location
USING sum(num_stations)
),
-- 3. calculate fraction of stations by neighbourhood
t3 AS (
SELECT *,
Downtown+East_of_Downtown+Others+West_of_Downtown AS total,
100*Downtown/total AS frac_Downtown,
100*East_of_Downtown/total AS frac_East_of_Downtown,
100*West_of_Downtown/total AS frac_West_of_Downtown,
100*Others/total AS frac_Other_Neighbourhoods
FROM t2
),
-- 4. reshape into tidy data
t4 AS (
UNPIVOT (
SELECT station_type,
frac_downtown,
frac_East_of_Downtown,
frac_West_of_Downtown,
frac_Other_Neighbourhoods
FROM t3
)
ON frac_downtown,
frac_East_of_Downtown,
frac_West_of_Downtown,
frac_Other_Neighbourhoods
INTO NAME variable VALUE value
)
SELECT * EXCLUDE(variable),
REPLACE(
REPLACE(variable, 'frac_', ''),
'_',
' '
) AS variable
FROM t4
"""
df_downtown_adjacent_query = run_sql_query(query).convert_dtypes()
df_downtown_adjacent_query
CPU times: user 46.6 ms, sys: 0 ns, total: 46.6 ms Wall time: 31.5 ms
| station_type | value | variable | |
|---|---|---|---|
| 0 | Others | 25.897921 | Downtown |
| 1 | Others | 5.860113 | East of Downtown |
| 2 | Others | 14.555766 | West of Downtown |
| 3 | Others | 53.6862 | Other Neighbourhoods |
| 4 | Top Performer | 77.55102 | Downtown |
| 5 | Top Performer | 1.020408 | East of Downtown |
| 6 | Top Performer | 16.326531 | West of Downtown |
| 7 | Top Performer | 5.102041 | Other Neighbourhoods |
Show above as a chart
%%time
chart = vzu.plot_non_grouped_bar_chart_grid(
df=df_downtown_adjacent_query,
rows_labels_colors={
'Top Performer': {
'labels': [
'Downtown',
'West of Downtown',
'Other Neighbourhoods',
'East of Downtown',
],
'colors': [
'darkgreen',
'darkred',
'lightgrey',
'#a1d99b',
],
},
'Others': {
'labels': [
'Other Neighbourhoods',
'Downtown',
'West of Downtown',
'East of Downtown',
],
'colors': [
'lightgrey',
'darkgreen',
'darkred',
'#a1d99b',
],
}
},
xvar='value:Q',
yvar='variable:N',
color_by_col='variable:N',
legend_label_font_color='#636363',
title_text='Downtown and West Outperform Other Neighbourhoods',
xtitle_text='Stations (%)',
title_fontweight='normal',
title_font_color='grey',
axis_label_fontsize=14,
ptitle_xloc=20,
fig_size = dict(width=450, height=125),
)
chart.save(
os.path.join(
figures_dir, '03_top_performers_downtown_neighbourhoods.png'
)
)
chart
CPU times: user 130 ms, sys: 0 ns, total: 130 ms Wall time: 83.4 ms
Observations
- As was seen in the previous chart, Downtown neighbourhoods capture up nearly 80% of all top performing stations. The next highest grouping of neighbourhoods is those adjacent to Downtown Toronto and to its West, which captures nearly 16% of the top-performers. Notably, the neighbourhoods adjacent to the east of the downtown neighbourhoods capture approximately 5% of these top-performing stations.
- For stations that are not top-performers, the same ordering is seen, with downtown neighbourhoods ahead of adjacent western neighbourhoods and finally adjacent eastern neighbourhoods. As expected, more than half of these non- top-performing stations are not found in Downtown Toronto. By comparison, the downtown neighbourhoods account for approximately one quarter of these stations.
Summary of Metrics Used
- fraction of bike share stations by neighbourhood, for neighbourhoods close to Downtown Toronto
Get the Fraction of Stations Accepting Payment via Credit Card¶
- From the ranked stations, count number of stations by station type and whether the station accepts credit card payments or not
- reshape into untidy data
- calculate fraction of stations that do and do not accept credit card payments when checking out a bike from a dock at the station
%%time
query = f"""
WITH t1 AS (
-- 1. count number of stations by station type and whether they accept credit card
-- payments
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'Top Performer'
ELSE 'Others' END
) AS station_type,
-- indicate if station accepts payments using a credit card
(
CASE WHEN credit = True
THEN 'Credit_accepted'
ELSE 'Credit_not_accepted' END
) AS takes_credit,
-- count the number of stations that do and do not accept payments using a
-- credit card
COUNT(DISTINCT(station_id)) AS num_stations
FROM df_stations
GROUP BY ALL
),
-- 2. reshape using pivot table
t2 AS (
PIVOT t1
ON takes_credit
USING sum(num_stations)
)
-- 3. calculate fraction of stations that do and do not accept credit card payments
SELECT *,
Credit_accepted+Credit_not_accepted AS total,
100*Credit_accepted/total AS frac_takes_credit
FROM t2
"""
df_payment_methods_query = run_sql_query(query).convert_dtypes()
df_payment_methods_query
CPU times: user 31.9 ms, sys: 0 ns, total: 31.9 ms Wall time: 20 ms
| station_type | Credit_accepted | Credit_not_accepted | total | frac_takes_credit | |
|---|---|---|---|---|---|
| 0 | Others | 419 | 110 | 529 | 79.206049 |
| 1 | Top Performer | 85 | 13 | 98 | 86.734694 |
Show above as a chart
%%time
chart = vzu.plot_bar_chart(
df_payment_methods_query,
xvar='frac_takes_credit:Q',
yvar='station_type:N',
xtitle='Stations (%)',
y_axis_sort=['Top Performer', 'Others'],
color_by_col='station_type:N',
color_labels=['Top Performer', 'Others'],
color_values=['darkgreen', 'lightgrey'],
ptitle=alt.TitleParams(
'Stations Accept Credit Card as a Method of Payment',
anchor='start',
dx=100,
fontSize=axis_label_fontsize
),
axis_label_fontsize=axis_label_fontsize,
fig_size=dict(width=450, height=125),
)
chart.save(
os.path.join(
figures_dir, '04_top_performers_payment_methods.png'
)
)
chart
CPU times: user 36.4 ms, sys: 0 ns, total: 36.4 ms Wall time: 31 ms
Observations
- Almost all type of stations accept credit card as a method of payment. This would be a highly conveneint method of checking out a bike for most bike share users. It is reassuring that nearly all stations support this.
Summary of Metrics Used
- fraction of bike share stations that accept payments using a credit card
Get the Fraction of Stations Based on Their Configuration¶
- From the ranked stations, count number of stations by station type and physical configuration
- reshape into untidy data for convenience when calculating fractions
- group the
MAPFRAME(kiosk-less) configurations into a single category in order to get three overall categoriesREGULAR(classic station, the most popular)NO-KIOSK(no kiosk)VAULTELECTRICBIKESTATION(supports e-bike charging)
- calculate fraction of stations based on their physical configuration (one of
REGULAR,NO-KIOSK,VAULTandELECTRICBIKESTATION) - reshape into untidy data for plotting
%%time
query = f"""
WITH t1 AS (
-- 1. count number of stations by station type and configuration
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'Top Performer'
ELSE 'Others' END
) AS station_type,
-- extract station configuration (REGULAR, *MAPFRAME, ELECTRICBIKESTATION)
physical_configuration,
-- count stations
COUNT(DISTINCT(station_id)) AS num_stations
FROM df_stations
GROUP BY ALL
),
-- 2. reshape into untidy data using pivot table
t2 AS (
PIVOT t1
ON physical_configuration
USING sum(num_stations)
),
-- 3. create a new physical configuration as no-kiosk stations which includes
-- all MAPFRAME configurations combined
t3 AS (
SELECT *,
-- calculate total number of stations in each category
(
COALESCE(ELECTRICBIKESTATION, NULL, 0)
+COALESCE(REGULAR, NULL, 0)
+COALESCE(REGULARLITMAPFRAME, NULL, 0)
+COALESCE(SMARTMAPFRAME, NULL, 0)
+COALESCE(SMARTLITMAPFRAME, NULL, 0)
) AS total,
-- calculate total number of stations in the new no-kiosk (*MAPFRAME) category
(
COALESCE(REGULARLITMAPFRAME, NULL, 0)
+COALESCE(SMARTMAPFRAME, NULL, 0)
+COALESCE(SMARTLITMAPFRAME, NULL, 0)
) AS No_Kiosk
FROM t2
),
-- 4. calculate fraction of stations in REGULAR, NO-KIOSK, VAULT & ELECTRICBIKESTATION configs
t4 AS (
SELECT station_type,
100*COALESCE(ELECTRICBIKESTATION, NULL, 0)/total AS Electric,
100*COALESCE(REGULAR, NULL, 0)/total AS Regular,
100*COALESCE(VAULT, NULL, 0)/total AS Vault,
100*COALESCE(SMARTMAPFRAME, NULL, 0)/total AS 'No Kiosk'
FROM t3
),
-- 5. reshape into tidy data using pivot table
t5 AS (
UNPIVOT t4
ON Electric, Regular, Vault, 'No Kiosk'
INTO NAME physical_configuration
VALUE value
)
SELECT *
FROM t5
"""
df_physical_config_query = run_sql_query(query).convert_dtypes()
pu.show_df(df_physical_config_query)
| column | station_type | physical_configuration | value |
|---|---|---|---|
| dtype | string[python] | string[python] | Float64 |
| nunique | 2 | 4 | 8 |
| missing | 0 | 0 | 0 |
| 0 | Others | Electric | 2.816901 |
| 1 | Others | Regular | 81.086519 |
| 2 | Others | Vault | 6.438632 |
| 3 | Others | No Kiosk | 15.492958 |
| 4 | Top Performer | Electric | 1.020408 |
| 5 | Top Performer | Regular | 85.714286 |
| 6 | Top Performer | Vault | 0.0 |
| 7 | Top Performer | No Kiosk | 13.265306 |
CPU times: user 108 ms, sys: 21.2 ms, total: 129 ms Wall time: 96.8 ms
Show above as a chart
%%time
chart = vzu.plot_grouped_bar_chart(
df=df_physical_config_query,
xvar="physical_configuration:N",
yvar="sum(value):Q",
color_by_col="physical_configuration:N",
column_var="station_type:N",
ytitle='Stations (%)',
color_labels=['Regular', 'No Kiosk', 'Vault', 'Electric'],
color_values=['darkgreen', 'lightgrey', 'lightblue', 'lightgreen'],
axis_label_fontsize=14,
ptitle=alt.TitleParams(
'Top-Performing Stations Have Preference for Guidance Kiosk',
anchor='start',
dx=20,
fontSize=axis_label_fontsize,
),
facet_column_spacing=20,
fig_size=dict(width=250, height=300),
)
chart.save(
os.path.join(
figures_dir, '05_top_performers_physical_configurations.png'
)
)
chart
CPU times: user 58.2 ms, sys: 302 µs, total: 58.5 ms Wall time: 50.4 ms
Notes
- Cleaning of smart map frame stations (page 18/86)
- Smart map frame docks do not use the ticket kiosk to check out a bike (page 12/20)
Observations
- Top performing stations are found in a Regular configuration. These station configurations require the user to check out or check in a bike using a kiosk. Earlier, it was seen that most of the top-performers are also located downtown and that more stations were located there when the service was initially launched in order to gain the highest possible ridership. It is likely that a regular station configuration, with written instructions on the kiosk that provides guidance to the user, has kept users coming back to these stations due to both their ease-of-use and convenient location to office buildings. By comparison, the majority of MAPFRAME stations that gives the option to use the bike dock itself without a kiosk are likely not located in the areas of the heaviest usage. So, these stations can introduce more experimental features like a kiosk-less check-out process without negatively effecting the user experience (ex. due to technical difficulties) at high demand locations.
Summary of Metrics Used
- fraction of bike share stations within each of the following station configurations
- regular
- e-bike
- kiosk-less
Get the Fraction of Overall and Weekday/Weekend Top-Performers¶
Get the fraction of overall top-performers that are also top-performing stations on weekdays and weekends
def get_station_type(
df: pd.DataFrame, station_type_mapper: Dict[str, str]
) -> pd.DataFrame:
"""Get fraction of simultaneous overall, weekday & weekend top-performers."""
df_station_types = pd.DataFrame.from_records(
[
{
'station_type': label,
"num_stations": len(df.query(query)),
'frac_stations_overall': (
100*len(df.query(query))/len(df.query("is_top_perform_station"))
),
}
for query, label in station_type_mapper.items()
]
)
return df_station_types
station_type_criteria = {
(
"(is_top_perform_station_weekday == True) & "
"(is_top_perform_station_weekend == True) & "
"(is_top_perform_station == True)"
): 'Includes Weekdays & Weekends',
(
'((is_top_perform_station_weekday == False) | '
'(is_top_perform_station_weekend == False)) & '
'(is_top_perform_station == True)'
): 'Excludes Weekdays or Weekends',
}
df_top_perform_overall_intra_week = (
get_station_type(df_stations, station_type_criteria)
.convert_dtypes()
)
pu.show_df(df_top_perform_overall_intra_week)
| column | station_type | num_stations | frac_stations_overall |
|---|---|---|---|
| dtype | string[python] | Int64 | Float64 |
| nunique | 2 | 2 | 2 |
| missing | 0 | 0 | 0 |
| 0 | Includes Weekdays & Weekends | 79 | 80.612245 |
| 1 | Excludes Weekdays or Weekends | 19 | 19.387755 |
Show above as a chart
%%time
chart = vzu.plot_pie_chart(
df_top_perform_overall_intra_week,
yvar = "frac_stations_overall:Q",
color_by_col = "station_type:N",
ptitle=alt.TitleParams(
'Most Top-Performers Show Consistent Intra-Week Performance',
anchor='start',
dx=0,
fontSize=axis_label_fontsize
),
label_non_white_color = 'Includes Weekdays & Weekends',
radius = 180,
text_radius_offset = 230,
axis_label_fontsize = 14,
annotation_label_fontsize = 18,
annotation_radius = 90,
yvar_non_white_threshold = 25,
stroke_thickness = 5,
x_loc_annotation = 0,
x_loc_label = 40,
colors = {
'Includes Weekdays & Weekends': 'darkgreen',
'Excludes Weekdays or Weekends': '#c7e9c0',
},
)
chart.save(
os.path.join(
figures_dir, '06_top_performers_day_of_week.png'
)
)
chart
CPU times: user 59.8 ms, sys: 46 µs, total: 59.8 ms Wall time: 57.5 ms
Observations
- Only 20% of top-performing stations are also top-performers on one of weekends only or weekends only. The majority of overall top-performing stations are also top-performers on both weekdays only and weekends only. Focusing the ad campaign on the latter selection of stations will maximize ridership on throughout the week.
Summary of Metrics Used
- fraction of overall top-performing bike share stations that are also top-performers on weekdays and weekends
User Behaviour Insights¶
For all user behaviour (bike share ridership patterns) insights, the number of trips per period (month, hour, day of week) will be calculated for both station types. The two types of stations are top-performers and all other stations. By definition there are fewer top-performing stations than other types of stations. So, the raw totals for number of trips at top-performing stations will always smaller than the total trips across the other stations during the chosen period. This might incorrectly suggest that top-performers are under-performing relative to all other stations. Additionally, ridership patterns by both types of users (Annual and Casual users) might be different per period.
In order to take these factors into account, instead of using the raw totals, the metric to be compared when extracting insights from bike share ridership will be the average ridership per station during a given period. To do this, for each type of station, the total number of trips during a given period will be divided by the number of stations used during that same period. This metric will be calculated separately for each type of user. Also, the metric will be calculated per year or per year and month. It will be used to explore periodic user behaviour through bike share ridership patterns at each type of station and by both types of users.
Show Monthly Ridership By Station Type¶
Get the average monthly trips per station during 2018, per user type and station type using the following approach
- get monthly departures per user type and station
- get monthly arrivals per user type and station
- combine departures and arrivals (using
UNION) and then reshape into untidy data so that, for each station, the total departures and arrivals appear as columns - combine (
LEFT JOIN) with table indicating if station is a top-performer - fill any missing values in the
is_top_performercolumn after combining in step 4. - get monthly departures and arrivals per station type and user type and calculate the average number of monthly departures and arrivals per station by taking the ratio if the total monthly departures and arrivals columns to the total stations used per month
%%time
query = f"""
WITH t1 AS (
-- 1. get number of monthly departures per station, user type and month in 2018
SELECT start_station_id AS station_id,
started_at_year AS year,
started_at_month AS month,
REPLACE(user_type, ' Member', '') AS user_type,
COUNT(DISTINCT(trip_id)) AS trips,
'departures' AS type
FROM read_parquet({fpaths_proc[2018]})
GROUP BY all
),
t2 AS (
-- 2. get number of monthly arrivals per station, user type and month in 2018
SELECT end_station_id AS station_id,
ended_at_year AS year,
ended_at_month AS month,
REPLACE(user_type, ' Member', '') AS user_type,
COUNT(DISTINCT(trip_id)) AS trips,
'arrivals' AS type
FROM read_parquet({fpaths_proc[2018]})
WHERE ended_at_year = 2018
GROUP BY all
),
-- 3. combine monthly departures and arrivals per station
-- first use UNION to capture stations with only departures or only arrivals
-- then reshape into untidy data to get departures & arrivals per station as columns
t3 AS (
PIVOT (
SELECT *
FROM t1
UNION
SELECT *
FROM t2
)
ON type USING MAX(trips)
),
-- 4. LEFT JOIN with table indicating if station is a top-performer
t4 AS (
SELECT *
FROM t3
LEFT JOIN (
SELECT station_id,
is_top_perform_station
FROM df_stations
) USING (station_id)
),
-- 5. fill any missing values in the is_top_perform_station column after above
-- LEFT JOIN with FALSE
t5 AS (
SELECT --fill missing values in is_top_perform_station with FALSE
* EXCLUDE(is_top_perform_station, departures, arrivals),
COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station,
COALESCE(departures, NULL, 0) AS departures,
COALESCE(arrivals, NULL, 0) AS arrivals
FROM t4
),
-- 6. calculate average monthly trips per station, station type and user type
t6 AS (
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'top-performer'
ELSE 'others' END
) AS station_type,
user_type,
year,
month,
-- number of stations used
COUNT(DISTINCT(station_id)) AS num_stations_used,
-- strftime(make_date(year, month, 1), '%Y-%m') AS date_ym,
make_date(year, month, 1) AS date_ym,
-- get departures
SUM(departures) AS departures,
-- get arrivals
SUM(arrivals) AS arrivals,
-- get average departures
SUM(departures)/num_stations_used AS avg_departures_per_station,
-- get average arrivals
SUM(arrivals)/num_stations_used AS avg_arrivals_per_station,
strftime(make_date(2018, 7, 1), '%Y-%m') AS peak_ridership
FROM t5
GROUP BY ALL
ORDER BY ALL
)
SELECT *
FROM t6
"""
df_by_month_2018 = run_sql_query(query).convert_dtypes()
pu.show_df(df_by_month_2018)
| column | station_type | user_type | year | month | num_stations_used | date_ym | departures | arrivals | avg_departures_per_station | avg_arrivals_per_station | peak_ridership |
|---|---|---|---|---|---|---|---|---|---|---|---|
| dtype | string[python] | string[python] | Int32 | Int32 | Int64 | datetime64[us] | Int64 | Int64 | Float64 | Float64 | string[python] |
| nunique | 2 | 2 | 1 | 12 | 20 | 12 | 48 | 48 | 48 | 48 | 1 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | others | Annual | 2018 | 1 | 192 | 2018-01-01 | 22522 | 22221 | 117.302083 | 115.734375 | 2018-07 |
| 1 | others | Annual | 2018 | 2 | 193 | 2018-02-01 | 25536 | 25219 | 132.310881 | 130.668394 | 2018-07 |
| 2 | others | Annual | 2018 | 3 | 192 | 2018-03-01 | 42383 | 41675 | 220.744792 | 217.057292 | 2018-07 |
| 3 | others | Annual | 2018 | 4 | 192 | 2018-04-01 | 44578 | 44226 | 232.177083 | 230.34375 | 2018-07 |
| 4 | others | Annual | 2018 | 5 | 193 | 2018-05-01 | 88022 | 86699 | 456.072539 | 449.217617 | 2018-07 |
| 5 | others | Annual | 2018 | 6 | 210 | 2018-06-01 | 101149 | 100255 | 481.661905 | 477.404762 | 2018-07 |
| 6 | others | Annual | 2018 | 7 | 245 | 2018-07-01 | 114157 | 113672 | 465.946939 | 463.967347 | 2018-07 |
| 7 | others | Annual | 2018 | 8 | 265 | 2018-08-01 | 112639 | 112759 | 425.05283 | 425.50566 | 2018-07 |
| 8 | others | Annual | 2018 | 9 | 270 | 2018-09-01 | 114040 | 114485 | 422.37037 | 424.018519 | 2018-07 |
| 9 | others | Annual | 2018 | 10 | 270 | 2018-10-01 | 89755 | 89136 | 332.425926 | 330.133333 | 2018-07 |
| 10 | others | Annual | 2018 | 11 | 270 | 2018-11-01 | 56093 | 55611 | 207.751852 | 205.966667 | 2018-07 |
| 11 | others | Annual | 2018 | 12 | 270 | 2018-12-01 | 45113 | 44430 | 167.085185 | 164.555556 | 2018-07 |
| 12 | others | Casual | 2018 | 1 | 158 | 2018-01-01 | 582 | 568 | 3.683544 | 3.594937 | 2018-07 |
| 13 | others | Casual | 2018 | 2 | 172 | 2018-02-01 | 914 | 939 | 5.313953 | 5.459302 | 2018-07 |
| 14 | others | Casual | 2018 | 3 | 188 | 2018-03-01 | 2593 | 2539 | 13.792553 | 13.505319 | 2018-07 |
| 15 | others | Casual | 2018 | 4 | 191 | 2018-04-01 | 4650 | 4561 | 24.34555 | 23.879581 | 2018-07 |
| 16 | others | Casual | 2018 | 5 | 192 | 2018-05-01 | 18371 | 18236 | 95.682292 | 94.979167 | 2018-07 |
| 17 | others | Casual | 2018 | 6 | 210 | 2018-06-01 | 24843 | 24442 | 118.3 | 116.390476 | 2018-07 |
| 18 | others | Casual | 2018 | 7 | 245 | 2018-07-01 | 28759 | 28411 | 117.383673 | 115.963265 | 2018-07 |
| 19 | others | Casual | 2018 | 8 | 264 | 2018-08-01 | 30440 | 29888 | 115.30303 | 113.212121 | 2018-07 |
| 20 | others | Casual | 2018 | 9 | 269 | 2018-09-01 | 20422 | 20400 | 75.918216 | 75.836431 | 2018-07 |
| 21 | others | Casual | 2018 | 10 | 267 | 2018-10-01 | 6624 | 6623 | 24.808989 | 24.805243 | 2018-07 |
| 22 | others | Casual | 2018 | 11 | 251 | 2018-11-01 | 1630 | 1693 | 6.494024 | 6.74502 | 2018-07 |
| 23 | others | Casual | 2018 | 12 | 237 | 2018-12-01 | 1173 | 1172 | 4.949367 | 4.945148 | 2018-07 |
| 24 | top-performer | Annual | 2018 | 1 | 77 | 2018-01-01 | 19651 | 19949 | 255.207792 | 259.077922 | 2018-07 |
| 25 | top-performer | Annual | 2018 | 2 | 77 | 2018-02-01 | 21394 | 21710 | 277.844156 | 281.948052 | 2018-07 |
| 26 | top-performer | Annual | 2018 | 3 | 77 | 2018-03-01 | 35572 | 36280 | 461.974026 | 471.168831 | 2018-07 |
| 27 | top-performer | Annual | 2018 | 4 | 77 | 2018-04-01 | 36763 | 37111 | 477.441558 | 481.961039 | 2018-07 |
| 28 | top-performer | Annual | 2018 | 5 | 77 | 2018-05-01 | 70712 | 72026 | 918.337662 | 935.402597 | 2018-07 |
| 29 | top-performer | Annual | 2018 | 6 | 84 | 2018-06-01 | 82588 | 83481 | 983.190476 | 993.821429 | 2018-07 |
| 30 | top-performer | Annual | 2018 | 7 | 85 | 2018-07-01 | 97056 | 97546 | 1141.835294 | 1147.6 | 2018-07 |
| 31 | top-performer | Annual | 2018 | 8 | 85 | 2018-08-01 | 93024 | 92899 | 1094.4 | 1092.929412 | 2018-07 |
| 32 | top-performer | Annual | 2018 | 9 | 85 | 2018-09-01 | 89627 | 89196 | 1054.435294 | 1049.364706 | 2018-07 |
| 33 | top-performer | Annual | 2018 | 10 | 87 | 2018-10-01 | 68707 | 69320 | 789.735632 | 796.781609 | 2018-07 |
| 34 | top-performer | Annual | 2018 | 11 | 88 | 2018-11-01 | 43753 | 44236 | 497.193182 | 502.681818 | 2018-07 |
| 35 | top-performer | Annual | 2018 | 12 | 88 | 2018-12-01 | 34849 | 35539 | 396.011364 | 403.852273 | 2018-07 |
| 36 | top-performer | Casual | 2018 | 1 | 77 | 2018-01-01 | 588 | 602 | 7.636364 | 7.818182 | 2018-07 |
| 37 | top-performer | Casual | 2018 | 2 | 77 | 2018-02-01 | 1035 | 1010 | 13.441558 | 13.116883 | 2018-07 |
| 38 | top-performer | Casual | 2018 | 3 | 77 | 2018-03-01 | 2557 | 2611 | 33.207792 | 33.909091 | 2018-07 |
| 39 | top-performer | Casual | 2018 | 4 | 77 | 2018-04-01 | 4734 | 4820 | 61.480519 | 62.597403 | 2018-07 |
| 40 | top-performer | Casual | 2018 | 5 | 77 | 2018-05-01 | 19448 | 19577 | 252.571429 | 254.246753 | 2018-07 |
| 41 | top-performer | Casual | 2018 | 6 | 84 | 2018-06-01 | 26040 | 26429 | 310.0 | 314.630952 | 2018-07 |
| 42 | top-performer | Casual | 2018 | 7 | 85 | 2018-07-01 | 28515 | 28874 | 335.470588 | 339.694118 | 2018-07 |
| 43 | top-performer | Casual | 2018 | 8 | 85 | 2018-08-01 | 28435 | 28981 | 334.529412 | 340.952941 | 2018-07 |
| 44 | top-performer | Casual | 2018 | 9 | 85 | 2018-09-01 | 19031 | 19069 | 223.894118 | 224.341176 | 2018-07 |
| 45 | top-performer | Casual | 2018 | 10 | 87 | 2018-10-01 | 6773 | 6774 | 77.850575 | 77.862069 | 2018-07 |
| 46 | top-performer | Casual | 2018 | 11 | 87 | 2018-11-01 | 1552 | 1488 | 17.83908 | 17.103448 | 2018-07 |
| 47 | top-performer | Casual | 2018 | 12 | 87 | 2018-12-01 | 1181 | 1183 | 13.574713 | 13.597701 | 2018-07 |
CPU times: user 3.4 s, sys: 61.8 ms, total: 3.46 s Wall time: 536 ms
Notes
- A similar approach is used to get the average trips using other temporal aggregations (month of year, hour of day, weekday or weekend).
Show above as a chart
%%time
for yvar, title_text in zip(
['avg_departures_per_station:Q', 'avg_arrivals_per_station:Q'],
['Departures', 'Arrivals'],
):
chart = vzu.plot_grouped_line_charts(
df_by_month_2018,
pd.DataFrame({'date_ym': [pd.to_datetime('2018-07-01')]}),
xvar="yearmonth(date_ym):T",
yvar=yvar,
color_by_col="user_type:N",
color_labels=['Annual', 'Casual'],
color_values=['darkgreen', '#a1d99b'],
legend=alt.Legend(
titleFontSize=axis_label_fontsize,
labelFontSize=axis_label_fontsize,
labelColor='black',
orient='bottom',
direction='horizontal',
titleOrient='left',
),
annotation_text='Middle of year',
annotation_text_color='red', # 'grey'
annotation_text_opacity=0.5,
annotation_text_x_loc=-60,
annotation_text_y_loc=-135,
xvar_rule="yearmonth(peak_ridership):T",
color_rule='', # '#cccccc'
title_text=f'Peak 2018 {title_text} Occurs During Warmest Months',
sub_title_text_color='#7f7f7f',
sub_title_vertical_padding=5,
axis_label_fontsize=14,
x_axis_ticks=None,
concat='column',
fig_size=dict(width=450, height=300),
)
display(chart)
CPU times: user 140 ms, sys: 4.14 ms, total: 144 ms Wall time: 144 ms
Observations
- Peak ridership is observed during the warmer months of the year, for both types of stations and for both types of users (Casual and Annual).
- The monthly ridership patterns are similar for departures and arrivals since the network has higher usage during the warmer months. Trips departing from a location in the network must arrive at another station, so if departures follow a monthly pattern then it is not surprising that arrivals follow a similar pattern. Due to this strong similarity between trends for departures and arrivals, for further exploration only departures will be considered and this will be taken as trips.
- Annual ridership was dominant during 2018, more than double the Casual ridership. This is a patten seen at many bike share systems due to the proximity of a large number of stations to office buildings in downtown locations with a city.
Get the average monthly trips per station for each year and month in all available data per user type and station type, separately for weekdays and weekends
%%time
dfs_by_month = {}
for day_type, dow in zip(['weekday', 'weekend'], [[0, 1, 2, 3, 4], [5, 6]]):
dow_str = '('+', '.join([str(d) for d in dow])+')'
query = f"""
WITH t1 AS (
-- 1. get number of monthly trips per station, user type, year & month in all data
SELECT start_station_id AS station_id,
started_at_year AS year,
started_at_month AS month,
REPLACE(user_type, ' Member', '') AS user_type,
COUNT(DISTINCT(trip_id)) AS trips
FROM read_parquet({fpaths_proc_all})
WHERE ISODOW(started_at)-1 IN {dow_str}
GROUP BY all
),
-- 2. LEFT JOIN with table indicating if station is a top-performer
t2 AS (
SELECT *
FROM t1
LEFT JOIN (
SELECT station_id,
is_top_perform_station
FROM df_stations
) USING (station_id)
),
-- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
-- with FALSE
t3 AS (
SELECT --fill missing values in is_top_perform_station with FALSE
* EXCLUDE(is_top_perform_station),
COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
FROM t2
),
-- 4. calculate average monthly trips per station, station type, user type, year and month
t4 AS (
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'top-performer'
ELSE 'others' END
) AS station_type,
user_type,
year,
month,
-- get number of stations
COUNT(DISTINCT(station_id)) AS num_stations_used,
strftime(make_date(year, month, 1), '%Y-%m') AS date_ym,
-- get trips (ridership)
SUM(trips) AS trips,
-- get average ridership
SUM(trips)/num_stations_used AS avg_trips_per_station,
strftime(make_date(2021, 10, 1), '%Y-%m') AS corporate_start
FROM t3
GROUP BY ALL
ORDER BY ALL
)
SELECT *
FROM t4
"""
dfs_by_month[day_type] = run_sql_query(query).convert_dtypes()
pu.show_df(dfs_by_month['weekday'])
| column | station_type | user_type | year | month | num_stations_used | date_ym | trips | avg_trips_per_station | corporate_start |
|---|---|---|---|---|---|---|---|---|---|
| dtype | string[python] | string[python] | Int32 | Int32 | Int64 | string[python] | Int64 | Float64 | string[python] |
| nunique | 2 | 2 | 6 | 12 | 98 | 63 | 252 | 252 | 1 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | others | Annual | 2018 | 1 | 191 | 2018-01 | 18577 | 97.26178 | 2021-10 |
| 1 | others | Annual | 2018 | 2 | 192 | 2018-02 | 21486 | 111.90625 | 2021-10 |
| 2 | others | Annual | 2018 | 3 | 192 | 2018-03 | 34113 | 177.671875 | 2021-10 |
| 3 | others | Annual | 2018 | 4 | 192 | 2018-04 | 36652 | 190.895833 | 2021-10 |
| 4 | others | Annual | 2018 | 5 | 192 | 2018-05 | 73430 | 382.447917 | 2021-10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 247 | top-performer | Casual | 2022 | 11 | 96 | 2022-11 | 59875 | 623.697917 | 2021-10 |
| 248 | top-performer | Casual | 2022 | 12 | 96 | 2022-12 | 37288 | 388.416667 | 2021-10 |
| 249 | top-performer | Casual | 2023 | 1 | 96 | 2023-01 | 38847 | 404.65625 | 2021-10 |
| 250 | top-performer | Casual | 2023 | 2 | 96 | 2023-02 | 38214 | 398.0625 | 2021-10 |
| 251 | top-performer | Casual | 2023 | 3 | 96 | 2023-03 | 55798 | 581.229167 | 2021-10 |
252 rows × 9 columns
CPU times: user 16.3 s, sys: 691 ms, total: 17 s Wall time: 1.75 s
Notes
It may be possible to replace
t2andt3by a single CTE that both performs theLEFT JOINand then fills missing values as shown below, while also respecting SQL order of operations-- LEFT JOIN with table indicating if station is a top-performer and fill missing values t3 AS ( SELECT * EXCLUDE(is_top_perform_station), COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station FROM t1 LEFT JOIN ( SELECT station_id, is_top_perform_station FROM df_stations ) USING (station_id) ),
This is more condensed than the version used above. In future work, it should be verified that this approach can be used.
Show above as a chart
%%time
for day_type, fname_no_ext in zip(
['weekday', 'weekend'],
['07_monthly_weekday_ridership', '08_monthly_weekend_ridership'],
):
chart = vzu.plot_grouped_line_charts(
dfs_by_month[day_type],
pd.DataFrame({'date_ym': [pd.to_datetime('2022-10-01')]}),
xvar="yearmonth(date_ym):T",
yvar="avg_trips_per_station:Q",
color_by_col="user_type:N",
color_labels=['Annual', 'Casual'],
color_values=['darkgreen', '#a1d99b'],
legend=alt.Legend(
titleFontSize=axis_label_fontsize,
labelFontSize=axis_label_fontsize,
labelColor='black',
orient='bottom',
direction='horizontal',
titleOrient='left',
),
annotation_text='Corporate Plan Starts' if day_type == 'weekday' else '',
annotation_text_color='red', # 'grey'
annotation_text_opacity=0.5,
annotation_text_x_loc=-175,
annotation_text_y_loc=-95,
xvar_rule="yearmonth(corporate_start):T",
color_rule='red', # '#cccccc'
title_text=(
'Inversion in Casual & Annual Patterns since Oct. 2021'
if day_type == 'weekday'
else 'Weekend Casual Ridership is Dominant since Oct. 2021'
),
sub_title_text_color='#7f7f7f',
sub_title_vertical_padding=5,
axis_label_fontsize=14,
x_axis_ticks=None,
concat='column',
fig_size=dict(width=450, height=300),
)
chart.save(os.path.join(figures_dir, f'{fname_no_ext}.png'))
display(chart)
CPU times: user 577 ms, sys: 7.88 ms, total: 585 ms Wall time: 496 ms
Observations
- A corporate bikeshare membership plan was offered in late September of 2021. Since then, average monthly Casual ridership has accounted for the majority of overall bike share ridership for several reasons
- switch to hybrid working hours
- use of bike share as an outdoor alternative to public transit to practice social-distancing
- increased environmental awareness
- improved cost flexibility (the corporate plan might partly account for this)
- increased bicycle infrastructure (eg. bike lanes) introduced by the city of Toronto
- At both types of stations, Casual ridership shows a year-over-year increase since the corporate plan was offered. By comparison, Annual monthly ridership is nearly unchanged at top performing stations. The same is true on weekdays and weekends. However, at other (non- top-performing) stations, Annual ridership has shown a downward trend down on weekdays between 2018 and 2022.
- Both types of members' ridership peak during the warmer months (middle) of the year at both types of stations on weekdays and weekends. This is the prime bike share season in Toronto. A wider peak is observed for Casual than for Annual ridership at both types of stations, suggesting that seasonal growth in Casual ridership starts earlier in the year and ends later in the year.
- On weekends, average Casual and Annual monthly ridership was nearly equal and relatively unchanged from 2019 to 2021. Annual ridership outpeformed Casual ridership on weekdays from 2018 to 2021. These were the distinguishing factors between Annual and Casual ridership during the four years between 2018 and 2021 inclusive. However, for weekdays and weekends in 2022, Casual monthly ridership has outperformed Annual ridership for every month since April. This is true for both types of stations.
Summary of Metrics Used
- average number of monthly departures (trips) per station, user type and station type for each year and month
User Behaviour Insights - Show Hourly Ridership By Station Type¶
Get the average hourly trips per station during 2018, per user type and station type using the following approach
- get hourly departures (trips) per station and user type during 2018
- combine (
LEFT JOIN) with table indicating if table is a top-performer - fill any missing values in the
is_top_performercolumn after combining in step 4. - get hourly departures (trips) per station type and user type and calculate average number of hourly trips per station by taking the ratio of the total trips to the total stations used per hour during 2018
%%time
query = f"""
WITH t1 AS (
-- 1. get number of hourly trips per station and user type in 2018
SELECT start_station_id AS station_id,
started_at_year AS year,
started_at_hour AS hour,
REPLACE(user_type, ' Member', '') AS user_type,
COUNT(DISTINCT(trip_id)) AS trips
FROM read_parquet({fpaths_proc[2018]})
GROUP BY all
),
-- 2. LEFT JOIN with table indicating if station is a top-performer
t2 AS (
SELECT *
FROM t1
LEFT JOIN (
SELECT station_id,
is_top_perform_station
FROM df_stations
) USING (station_id)
),
-- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
-- with FALSE
t3 AS (
SELECT --fill missing values in is_top_perform_station with FALSE
* EXCLUDE(is_top_perform_station),
COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
FROM t2
),
-- 4. calculate average hourly trips per station, station type and user type
t4 AS (
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'top-performer'
ELSE 'others' END
) AS station_type,
user_type,
year,
hour,
-- get number of stations
COUNT(DISTINCT(station_id)) AS num_stations_used,
strftime(make_timestamp(year, 1, 1, hour, 0, 0), '%Y %H') AS date_yh,
-- get trips (ridership)
SUM(trips) AS trips,
-- get average ridership
SUM(trips)/num_stations_used AS avg_trips_per_station,
'2018 08' AS peak_ridership
FROM t3
GROUP BY ALL
ORDER BY ALL
)
SELECT *
FROM t4
"""
df_by_hour_2018 = run_sql_query(query).convert_dtypes()
pu.show_df(df_by_hour_2018)
| column | station_type | user_type | year | hour | num_stations_used | date_yh | trips | avg_trips_per_station | peak_ridership |
|---|---|---|---|---|---|---|---|---|---|
| dtype | string[python] | string[python] | Int32 | Int32 | Int64 | string[python] | Int64 | Float64 | string[python] |
| nunique | 2 | 2 | 1 | 24 | 40 | 24 | 96 | 96 | 1 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | others | Annual | 2018 | 0 | 253 | 2018 00 | 7813 | 30.881423 | 2018 08 |
| 1 | others | Annual | 2018 | 1 | 245 | 2018 01 | 4788 | 19.542857 | 2018 08 |
| 2 | others | Annual | 2018 | 2 | 238 | 2018 02 | 3294 | 13.840336 | 2018 08 |
| 3 | others | Annual | 2018 | 3 | 217 | 2018 03 | 1486 | 6.847926 | 2018 08 |
| 4 | others | Annual | 2018 | 4 | 198 | 2018 04 | 1449 | 7.318182 | 2018 08 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 91 | top-performer | Casual | 2018 | 19 | 88 | 2018 19 | 10660 | 121.136364 | 2018 08 |
| 92 | top-performer | Casual | 2018 | 20 | 87 | 2018 20 | 8190 | 94.137931 | 2018 08 |
| 93 | top-performer | Casual | 2018 | 21 | 86 | 2018 21 | 6269 | 72.895349 | 2018 08 |
| 94 | top-performer | Casual | 2018 | 22 | 86 | 2018 22 | 4612 | 53.627907 | 2018 08 |
| 95 | top-performer | Casual | 2018 | 23 | 85 | 2018 23 | 3425 | 40.294118 | 2018 08 |
96 rows × 9 columns
CPU times: user 1.24 s, sys: 23 ms, total: 1.27 s Wall time: 276 ms
Show above as a chart
%%time
chart = vzu.plot_grouped_line_charts(
df_by_hour_2018,
pd.DataFrame({'date_yh': ['2018 08']}),
xvar="date_yh:N",
yvar="avg_trips_per_station:Q",
color_by_col="user_type:N",
color_labels=['Annual', 'Casual'],
color_values=['darkgreen', '#a1d99b'],
legend=alt.Legend(
titleFontSize=axis_label_fontsize,
labelFontSize=axis_label_fontsize,
labelColor='black',
orient='bottom',
direction='horizontal',
titleOrient='left',
),
annotation_text='Annual Peaks at 8AM, 5PM',
annotation_text_color='red', # 'grey'
annotation_text_opacity=0.5,
annotation_text_x_loc=60,
annotation_text_y_loc=-85,
xvar_rule="",
color_rule='red', # '#cccccc'
title_text='Annual 2018 Trips Show Commuting-Driven Usage',
sub_title_text_color='#7f7f7f',
sub_title_vertical_padding=5,
axis_label_fontsize=14,
x_axis_ticks=df_by_hour_2018.iloc[::2, :]['date_yh'].tolist(),
concat='column',
fig_size=dict(width=450, height=300),
)
chart
CPU times: user 32.6 ms, sys: 70 µs, total: 32.7 ms Wall time: 32.2 ms
Observations
- Certain characteristics of average hourly ridership per station by Annual and Casual members are visible. Annual ridership shows stronger evidence of a commuter-driven pattern - the service is used more for commuting to school or work by Annual members than for leisurely activities by Casual members. This results in peaks in ridership during the morning and evening rush hours. The evening rush hour peak is stronger and wider (covers more hours) than the morning rush hour peak. There is also a much weaker peak during the middle of the day, around the lunch hour.
- Casual member ridership shows weak or no evidence of a morning peak but has a gradual increase to peak ridership in the evening. Peak ridership for Casual members is weaker by a factor of nearly 10. This suggests casual ridership has a leisurely pattern unlike the commuter-driven pattern of Annual ridership.
- The same commuter-driven and leisurely patterns are observed at both types of stations (top-performers and others).
- The trends seen here are typical of bike share ridership trends seen at other bike share networks as well.
Get the average hourly trips per station for each year in all available data per user type and station type, separately for weekdays and weekends
%%time
dfs_by_hour = {}
for day_type, dow in zip(['weekday', 'weekend'], [[0, 1, 2, 3, 4], [5, 6]]):
dow_str = '('+', '.join([str(d) for d in dow])+')'
query = f"""
WITH t1 AS (
-- 1. get number of hourly trips per station, user type and year in all data
SELECT start_station_id AS station_id,
started_at_year AS year,
started_at_hour AS hour,
REPLACE(user_type, ' Member', '') AS user_type,
COUNT(DISTINCT(trip_id)) AS trips
FROM read_parquet({fpaths_proc_all})
WHERE ISODOW(started_at)-1 IN {dow_str}
GROUP BY all
),
-- 2. LEFT JOIN with table indicating if station is a top-performer
t2 AS (
SELECT *
FROM t1
LEFT JOIN (
SELECT station_id,
is_top_perform_station
FROM df_stations
) USING (station_id)
),
-- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
-- with FALSE
t3 AS (
SELECT --fill missing values in is_top_perform_station with FALSE
* EXCLUDE(is_top_perform_station),
COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
FROM t2
),
-- 4. calculate average monthly trips per station, station type, user type and year
t4 AS (
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'top-performer'
ELSE 'others' END
) AS station_type,
user_type,
year,
hour,
-- get number of stations
COUNT(DISTINCT(station_id)) AS num_stations_used,
strftime(make_timestamp(year, 1, 1, hour, 0, 0), '%Y %H') AS date_yh,
-- get trips (ridership)
SUM(trips) AS trips,
-- get average ridership
SUM(trips)/num_stations_used AS avg_trips_per_station,
'2021 00' AS corporate_start
FROM t3
GROUP BY ALL
ORDER BY ALL
)
SELECT *
FROM t4
"""
dfs_by_hour[day_type] = run_sql_query(query).convert_dtypes()
pu.show_df(dfs_by_hour['weekday'])
| column | station_type | user_type | year | hour | num_stations_used | date_yh | trips | avg_trips_per_station | corporate_start |
|---|---|---|---|---|---|---|---|---|---|
| dtype | string[python] | string[python] | Int32 | Int32 | Int64 | string[python] | Int64 | Float64 | string[python] |
| nunique | 2 | 2 | 6 | 24 | 214 | 144 | 569 | 576 | 1 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | others | Annual | 2018 | 0 | 242 | 2018 00 | 4451 | 18.392562 | 2021 00 |
| 1 | others | Annual | 2018 | 1 | 214 | 2018 01 | 2205 | 10.303738 | 2021 00 |
| 2 | others | Annual | 2018 | 2 | 199 | 2018 02 | 1398 | 7.025126 | 2021 00 |
| 3 | others | Annual | 2018 | 3 | 165 | 2018 03 | 632 | 3.830303 | 2021 00 |
| 4 | others | Annual | 2018 | 4 | 167 | 2018 04 | 1037 | 6.209581 | 2021 00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 571 | top-performer | Casual | 2023 | 19 | 96 | 2023 19 | 9601 | 100.010417 | 2021 00 |
| 572 | top-performer | Casual | 2023 | 20 | 96 | 2023 20 | 7357 | 76.635417 | 2021 00 |
| 573 | top-performer | Casual | 2023 | 21 | 96 | 2023 21 | 5729 | 59.677083 | 2021 00 |
| 574 | top-performer | Casual | 2023 | 22 | 96 | 2023 22 | 4101 | 42.71875 | 2021 00 |
| 575 | top-performer | Casual | 2023 | 23 | 94 | 2023 23 | 2706 | 28.787234 | 2021 00 |
576 rows × 9 columns
CPU times: user 19.2 s, sys: 607 ms, total: 19.8 s Wall time: 2.09 s
Show above as a chart
%%time
for day_type, fname_no_ext in zip(
['weekday', 'weekend'],
['09_hourly_weekday_ridership', '10_hourly_weekend_ridership'],
):
chart = vzu.plot_grouped_line_charts(
dfs_by_hour[day_type],
pd.DataFrame({'date_yh': ['2021 00']}),
xvar="date_yh:N",
yvar="avg_trips_per_station:Q",
color_by_col="user_type:N",
color_labels=['Annual', 'Casual'],
color_values=['darkgreen', '#a1d99b'],
legend=alt.Legend(
titleFontSize=axis_label_fontsize,
labelFontSize=axis_label_fontsize,
labelColor='black',
orient='bottom',
direction='horizontal',
titleOrient='left',
),
annotation_text='Corporate Plan Starts' if day_type == 'weekday' else '',
annotation_text_color='red', # 'grey'
annotation_text_opacity=0.5,
annotation_text_x_loc=-80,
annotation_text_y_loc=-115,
xvar_rule="corporate_start:N",
color_rule='red', # '#cccccc'
title_text=(
'Since Oct. 2021, Casual Hourly Usage Dominates Annual'
if day_type == 'weekday'
else 'Weekend Casual Hourly Usage is 2X Annual since Oct. 2021'
),
sub_title_text_color='#7f7f7f',
sub_title_vertical_padding=5,
axis_label_fontsize=14,
x_axis_ticks=dfs_by_hour[day_type].iloc[::16, :]['date_yh'].tolist(),
concat='column',
fig_size=dict(width=450, height=300),
)
chart.save(os.path.join(figures_dir, f'{fname_no_ext}.png'))
display(chart)
CPU times: user 765 ms, sys: 2.53 ms, total: 768 ms Wall time: 578 ms
Observations
- Weekdays
- For obvious reasons, the strong morning peak (centered at 8AM) in average hourly ridership per station by Annual members was much weaker as of 2020. The evening peak (centered at 4PM) remained. However, Casual ridership in 2022 is now showing the presence of a peak in morning and lunch hour ridership and a stronger and sharper peak in evening ridership than the peak seen in Annual ridership. This morning peak in Casual ridership was always present but weaker than it currently is. However, it first showed signs of growth in 2021. At both types of stations, Casual ridership now matches or exceeds average hourly ridership per station by Annual members. Through the emergence of this morning peak, Casual ridership has evolved to be a mix (hybrid) of previously observed Annual and Casual usage patterns. This is evidence of the emergence of hybrid workers who don't follow commuter-driven usage patterns that were previously seen in Annual member ridership alone.
- Weekends
- A single wider and more gradual peak in Annual and Casual ridership is visible and is centered at 4PM. There is no peak in morning ridership at both types of stations.
- For both types of users, average hourly weekend ridership was relatively unchanged from 2019 to 2021, in particular at the top-performing stations. However, in 2022, peak Casual hourly ridership has more than doubled Annual ridership. These findings reflect the impact of increased usage by Casual bike share members on hourly bike share ridership patterns in Toronto. Leisurely bike share ridership determines the pattern seen in weekend ridership on an hourly basis and Casual ridership is now dominant among overall bike share ridership on weekends at both types of stations.
Get the average hourly trips per station for each year and month in all available data for each user type and station type
%%time
dfs_by_month_hour = {}
for day_type, dow in zip(['weekday', 'weekend'], [[0, 1, 2, 3, 4], [5, 6]]):
dow_str = '('+', '.join([str(d) for d in dow])+')'
query = f"""
WITH t1 AS (
-- 1. get number of hourly trips per station, user type, year & month in all data
SELECT start_station_id AS station_id,
started_at_year AS year,
started_at_month AS month,
started_at_hour AS hour,
REPLACE(user_type, ' Member', '') AS user_type,
COUNT(DISTINCT(trip_id)) AS trips
FROM read_parquet({fpaths_proc[2020]+fpaths_proc[2021]+fpaths_proc[2022]})
WHERE ISODOW(started_at)-1 IN {dow_str}
GROUP BY all
),
-- 2. LEFT JOIN with table indicating if station is a top-performer
t2 AS (
SELECT *
FROM t1
LEFT JOIN (
SELECT station_id,
is_top_perform_station
FROM df_stations
) USING (station_id)
),
-- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
-- with FALSE
t3 AS (
SELECT --fill missing values in is_top_perform_station with FALSE
* EXCLUDE(is_top_perform_station),
COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
FROM t2
),
-- 4. calculate average hourly trips per station, station type, user type, year and month
t4 AS (
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'top-performer'
ELSE 'others' END
) AS station_type,
user_type,
year,
month,
hour,
-- get number of stations
COUNT(DISTINCT(station_id)) AS num_stations_used,
strftime(make_timestamp(year, month, 1, hour, 0, 0), '%Y-%m %H') AS date_yh,
-- get trips (ridership)
SUM(trips) AS trips,
-- get average ridership
SUM(trips)/num_stations_used AS avg_trips_per_station,
'2021-10 00' AS corporate_start
FROM t3
GROUP BY ALL
ORDER BY ALL
)
SELECT *
FROM t4
"""
dfs_by_month_hour[day_type] = run_sql_query(query).convert_dtypes()
pu.show_df(dfs_by_month_hour['weekday'])
| column | station_type | user_type | year | month | hour | num_stations_used | date_yh | trips | avg_trips_per_station | corporate_start |
|---|---|---|---|---|---|---|---|---|---|---|
| dtype | string[python] | string[python] | Int32 | Int32 | Int32 | Int64 | string[python] | Int64 | Float64 | string[python] |
| nunique | 2 | 2 | 3 | 12 | 24 | 498 | 864 | 2266 | 3177 | 1 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | others | Annual | 2020 | 1 | 0 | 121 | 2020-01 00 | 229 | 1.892562 | 2021-10 00 |
| 1 | others | Annual | 2020 | 1 | 1 | 91 | 2020-01 01 | 166 | 1.824176 | 2021-10 00 |
| 2 | others | Annual | 2020 | 1 | 2 | 66 | 2020-01 02 | 103 | 1.560606 | 2021-10 00 |
| 3 | others | Annual | 2020 | 1 | 3 | 36 | 2020-01 03 | 57 | 1.583333 | 2021-10 00 |
| 4 | others | Annual | 2020 | 1 | 4 | 44 | 2020-01 04 | 102 | 2.318182 | 2021-10 00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3447 | top-performer | Casual | 2022 | 12 | 19 | 94 | 2022-12 19 | 2418 | 25.723404 | 2021-10 00 |
| 3448 | top-performer | Casual | 2022 | 12 | 20 | 94 | 2022-12 20 | 1820 | 19.361702 | 2021-10 00 |
| 3449 | top-performer | Casual | 2022 | 12 | 21 | 94 | 2022-12 21 | 1437 | 15.287234 | 2021-10 00 |
| 3450 | top-performer | Casual | 2022 | 12 | 22 | 94 | 2022-12 22 | 1124 | 11.957447 | 2021-10 00 |
| 3451 | top-performer | Casual | 2022 | 12 | 23 | 86 | 2022-12 23 | 818 | 9.511628 | 2021-10 00 |
3452 rows × 10 columns
CPU times: user 13.9 s, sys: 420 ms, total: 14.3 s Wall time: 1.55 s
Show above as a chart
%%time
for day_type in ['weekday', 'weekend']:
chart = vzu.plot_grouped_line_charts(
dfs_by_month_hour[day_type],
pd.DataFrame({'date_yh': ['2021-10 00']}),
xvar="date_yh:N",
yvar="avg_trips_per_station:Q",
color_by_col="user_type:N",
color_labels=['Annual', 'Casual'],
color_values=['darkgreen', '#a1d99b'],
legend=alt.Legend(
titleFontSize=axis_label_fontsize,
labelFontSize=axis_label_fontsize,
labelColor='black',
orient='bottom',
direction='horizontal',
titleOrient='left',
),
annotation_text='Corporate Plan Starts' if day_type == 'weekday' else '',
annotation_text_color='red', # 'grey'
annotation_text_opacity=0.5,
annotation_text_x_loc=-80,
annotation_text_y_loc=-115,
xvar_rule="corporate_start:N",
color_rule='red', # '#cccccc'
title_text=(
'Since Oct. 2021, Casual Hourly Usage Dominates Annual'
if day_type == 'weekday'
else 'Weekend Casual Hourly Usage is 2X Annual since Oct. 2021'
),
sub_title_text_color='#7f7f7f',
sub_title_vertical_padding=5,
axis_label_fontsize=14,
x_axis_ticks=(
dfs_by_month_hour[day_type]
.query("(station_type == 'others') & (user_type == 'Annual')")
# show tick labels for month-hour combinations every three months
.iloc[::24*3]
['date_yh']
.tolist()
),
concat='row',
fig_size=dict(width=1_400, height=300),
)
display(chart)
CPU times: user 234 ms, sys: 4.03 ms, total: 238 ms Wall time: 237 ms
Observations
- These plots more clearly shows the emergence of the stronger morning peak in average hourly ridership by Casual members immediately after the corporate plan was offered during October and November of 2021. Annual ridership was dominant again during December 2021 to March or April of 2022, before Casual ridership again dominated average hourly ridership per station across the network.
- The same patterns are observed at top-performing and other stations.
Summary of Metrics Used
- average number of hourly departures (trips) per station, user type and station type for each
- year and month
- year
Show Ridership By Day of the Week and Station Type¶
Get the average trips per station by day of the week for each year in all available data per user type and station type using the following approach
- get weekday departures (trips) per station, user type and year in all available data (2018 to 2023)
- combine (
LEFT JOIN) with table indicating if table is a top-performer - fill any missing values in the
is_top_performercolumn after combining in step 2. - get monthly departures (trips) per station type and user type and calculate average departures (trips) per station for each day of the week by taking the ratio of the total trips to the total stations used per day of the week
%%time
query = f"""
WITH t1 AS (
-- 1. get number of weekday trips per station, user type and year in all data
-- using ID
SELECT start_station_id AS station_id,
started_at_year AS year,
DAYNAME(started_at) AS weekday,
REPLACE(user_type, ' Member', '') AS user_type,
COUNT(DISTINCT(trip_id)) AS trips
FROM read_parquet({fpaths_proc_all})
GROUP BY all
),
-- 2. LEFT JOIN with table indicating if station is a top-performer
t2 AS (
SELECT *
FROM t1
LEFT JOIN (
SELECT station_id,
is_top_perform_station
FROM df_stations
) USING (station_id)
),
-- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
-- with FALSE
t3 AS (
SELECT --fill missing values in is_top_perform_station with FALSE
* EXCLUDE(is_top_perform_station),
COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
FROM t2
),
-- 4. calculate average weekday trips per station, station type, user type and year
t4 AS (
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'top-performer'
ELSE 'others' END
) AS station_type,
user_type,
year,
weekday,
COUNT(DISTINCT(station_id)) AS num_stations_used,
SUM(trips) AS trips,
SUM(trips)/num_stations_used AS avg_trips_per_station
FROM t3
GROUP BY ALL
ORDER BY ALL
)
SELECT *
FROM t4
"""
df_by_day_of_week = run_sql_query(query).convert_dtypes()
pu.show_df(df_by_day_of_week.head())
| column | station_type | user_type | year | weekday | num_stations_used | trips | avg_trips_per_station |
|---|---|---|---|---|---|---|---|
| dtype | string[python] | string[python] | Int32 | string[python] | Int64 | Int64 | Float64 |
| nunique | 1 | 1 | 1 | 5 | 1 | 5 | 5 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | others | Annual | 2018 | Friday | 270 | 136756 | 506.503704 |
| 1 | others | Annual | 2018 | Monday | 270 | 121718 | 450.807407 |
| 2 | others | Annual | 2018 | Saturday | 270 | 84152 | 311.674074 |
| 3 | others | Annual | 2018 | Sunday | 270 | 77576 | 287.318519 |
| 4 | others | Annual | 2018 | Thursday | 270 | 147048 | 544.622222 |
CPU times: user 13.5 s, sys: 805 ms, total: 14.3 s Wall time: 1.5 s
Show above as a chart
%%time
weekday_colors = [
'lightgreen', '#39a055', 'black', '#208943', 'darkgreen', '#abdda5', '#c0e6ba'
]
chart = vzu.plot_bar_chart_array(
df_by_day_of_week,
xvar = 'weekday:N',
yvar = "sum(avg_trips_per_station):Q",
color_by_col = "weekday:N",
color_labels = {'Annual': list(day_name), 'Casual': list(day_name)},
color_values = {"Annual": weekday_colors, 'Casual': weekday_colors},
legend=alt.Legend(
titleFontSize=axis_label_fontsize,
labelFontSize=axis_label_fontsize,
labelColor='black',
orient='bottom',
direction='horizontal',
titleOrient='left',
),
title_text={
'Annual': (
'Weekday Ridership by Annual Members outside Top-Performers is '
'Contracting'
),
'Casual': 'Weekday Ridership by Casual Members is Growing Across Entire Network',
},
column_var = "year:N",
column_spacing = 5,
column_label_color = 'grey',
column_sort=list(day_name),
column_label_position = 'bottom',
sub_title_text_color="#7f7f7f",
sub_title_vertical_padding=5,
axis_label_fontsize=axis_label_fontsize,
concat='column',
column_label_align='center',
fig_size=dict(width=105, height=300),
)
chart['Annual'].save(os.path.join(figures_dir, '11_daily_weekday.png'))
chart['Casual'].save(os.path.join(figures_dir, '12_daily_weekend.png'))
display(chart['Annual'])
display(chart['Casual'])
CPU times: user 295 ms, sys: 3.84 ms, total: 299 ms Wall time: 233 ms
Notes
- For each year, the bars are ordered chronologically by day of the week, with Monday on the far left and Sunday on the far right.
- The four darker colors highlight the days of the week (Tuesday, Wednesday, Thursday and Friday) with highest average ridership per station by top-performing Annual members during 2018 and 2019.
Observations
- Trend
- Average ridership by day of the week per station by Annual members is showing a weak (top-performing) or more pronounced (others) drop since 2020. By comparison, the corresponding ridership was relatively unchanged from 2018 to 2019 at both types of stations.
- Average ridership by day of the week per station by Casual members has grown at both types of stations with the strongest growth qualitatively seen in 2022.
- Together, these trends are similar to the observations seen earlier by month
- Intra-week Seasonality
- For Casual members
- average weekday (Mon - Fri) ridership per station has been less than average weekend ridership since 2018
- average ridership per station increases per day of the week from Monday to Saturday, before dropping on Sunday
- For Annual members
- average weekend (Mon - Fri) ridership per station has been lower than average weekday ridership since 2018
- average ridersip per station reaches its maximum on Wednesday or Thursday since 2020
- For Casual members
- Other
- the emergence of dominance of Casual ridership during 2022 is also evident in daily average ridership per station: since 2022, average Casual member ridership per station is lowest on Monday but this is still larger than the highest average Annual member ridership per station on any day of the week.
Summary of Metrics Used
- average number of departures (trips) per station by day of the week, user type and station type for each year.
Show Relationship Between Temperature and Ridership By Station Type¶
Get the average daily trips per station for each year and month in all available data per user type and station type, combined with the maximum daily temperature in the city
- get daily departures (trips) per station, user type, year and month in all available data (2018 to 2023)
- combine with table indicating if table is a top-performer
- fill any missing values in the
is_top_performercolumn after combining in step 2. - get daily departures (trips) per station type and user type and calculate average number of daily departures (trips) per station for each day by taking the ratio of total daily trips to the total stations used per day
- combine with maximum daily temperature from weather data
%%time
query = f"""
WITH t1 AS (
-- 1. get number of daily trips per station, user type, year and month in all data
SELECT start_station_id AS station_id,
started_at_year AS year,
started_at_month AS month,
started_at_day AS day,
REPLACE(user_type, ' Member', '') AS user_type,
COUNT(DISTINCT(trip_id)) AS trips
FROM read_parquet({fpaths_proc_all})
GROUP BY all
),
-- 2. LEFT JOIN with table indicating if station is a top-performer
t2 AS (
SELECT *
FROM t1
LEFT JOIN (
SELECT station_id,
is_top_perform_station
FROM df_stations
) USING (station_id)
),
-- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
-- with FALSE
t3 AS (
SELECT --fill missing values in is_top_perform_station with FALSE
* EXCLUDE(is_top_perform_station),
COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
FROM t2
),
-- 4. calculate average daily trips per station, station type, user type, year and month
t4 AS (
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'top-performer'
ELSE 'others' END
) AS station_type,
user_type,
year,
month,
day,
make_date(year, month, day) AS date,
COUNT(DISTINCT(station_id)) AS num_stations_used,
SUM(trips) AS trips,
SUM(trips)/num_stations_used AS avg_trips_per_station
FROM t3
GROUP BY ALL
ORDER BY ALL
),
-- 5. LEFT JOIN with daily weather data
t5 AS (
SELECT *,
(
CASE
WHEN year = 2018 THEN 19
WHEN year = 2019 THEN 17
WHEN year = 2020 THEN 17
WHEN year = 2021 THEN 15
ELSE 2.5
END
) AS x_turn_point,
0.75 AS y_turn_point,
-- append indicator symbol for presence (2018, 2019) or absence (2020, 2021, 2022)
-- of turning point in ridership based on temperature
'↑' AS mark_turn_point
FROM t4
LEFT JOIN (
SELECT time as date,
tmin,
tmax
FROM read_parquet({[fpath_weather]})
) USING (date)
)
SELECT *,
make_date(2021, 10, 1) AS corporate_start
FROM t5
"""
df_daily = run_sql_query(query).convert_dtypes()
pu.show_df(df_daily)
| column | station_type | user_type | year | month | day | date | num_stations_used | trips | avg_trips_per_station | tmin | tmax | x_turn_point | y_turn_point | mark_turn_point | corporate_start |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dtype | string[python] | string[python] | Int32 | Int32 | Int32 | datetime64[us] | Int64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | string[python] | datetime64[us] |
| nunique | 2 | 2 | 6 | 12 | 31 | 1916 | 524 | 3895 | 6974 | 404 | 440 | 4 | 1 | 1 | 1 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 0 | 0 | 0 | 0 |
| 0 | others | Casual | 2018 | 1 | 31 | 2018-01-31 | 5 | 5 | 1.0 | -15.1 | 2.1 | 19.0 | 0.75 | ↑ | 2021-10-01 |
| 1 | others | Casual | 2018 | 2 | 1 | 2018-02-01 | 18 | 21 | 1.166667 | -10.4 | 4.6 | 19.0 | 0.75 | ↑ | 2021-10-01 |
| 2 | others | Casual | 2018 | 2 | 4 | 2018-02-04 | 4 | 4 | 1.0 | -9.0 | 1.5 | 19.0 | 0.75 | ↑ | 2021-10-01 |
| 3 | others | Casual | 2018 | 2 | 6 | 2018-02-06 | 12 | 14 | 1.166667 | -11.3 | -4.8 | 19.0 | 0.75 | ↑ | 2021-10-01 |
| 4 | others | Casual | 2018 | 2 | 7 | 2018-02-07 | 7 | 7 | 1.0 | -13.2 | -4.8 | 19.0 | 0.75 | ↑ | 2021-10-01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7652 | others | Annual | 2020 | 4 | 16 | 2020-04-16 | 285 | 845 | 2.964912 | -4.8 | 4.9 | 17.0 | 0.75 | ↑ | 2021-10-01 |
| 7653 | others | Annual | 2020 | 5 | 30 | 2020-05-30 | 351 | 2996 | 8.535613 | 8.0 | 18.7 | 17.0 | 0.75 | ↑ | 2021-10-01 |
| 7654 | others | Annual | 2020 | 7 | 2 | 2020-07-02 | 437 | 4335 | 9.919908 | 21.1 | 35.5 | 17.0 | 0.75 | ↑ | 2021-10-01 |
| 7655 | others | Annual | 2018 | 2 | 3 | 2018-02-03 | 136 | 480 | 3.529412 | -10.9 | -2.3 | 19.0 | 0.75 | ↑ | 2021-10-01 |
| 7656 | others | Annual | 2019 | 3 | 23 | 2019-03-23 | 239 | 1107 | 4.631799 | -6.9 | 5.1 | 17.0 | 0.75 | ↑ | 2021-10-01 |
7657 rows × 15 columns
CPU times: user 16.5 s, sys: 533 ms, total: 17 s Wall time: 1.72 s
Notes
- HTML arrow symbols are used as indicators.
Show above as a chart
%%time
chart = vzu.plot_scatter_chart_grid(
df_daily.query("year.isin(@years_temp_dependence)"),
xvar = 'tmax:Q',
yvar = 'avg_trips_per_station:Q',
row_var = "year:O",
color_by_col = 'user_type:N',
col_var = "station_type:N",
row_sort=years_temp_dependence,
facet_label_color='grey',
color_labels = ['Casual', 'Annual'],
color_values = ['darkgreen', '#a1d99b'],
legend=alt.Legend(
title=None,
titleFontSize=axis_label_fontsize,
labelFontSize=axis_label_fontsize,
labelColor='black',
orient='bottom',
direction='horizontal',
titleOrient='left',
),
y_scale='linear',
ptitle=alt.TitleParams(
(
'Casual Ridership Shows Increased Resilience to Colder '
'Temps. & Now Resembles Annual Ridership'
),
anchor='start',
dx=55,
dy=-5,
fontSize=axis_label_fontsize,
),
marker_size=60,
annotation_text_color = 'red',
annotation_text_x_loc_var = 'x_turn_point:Q',
annotation_text_y_loc_var = 'y_turn_point:Q',
annotation_text_color_by_col = 'mark_turn_point:N',
axis_label_fontsize = 14,
fig_size=dict(width=375, height=300),
)
chart.save(
os.path.join(
figures_dir, '13_daily_max_temperature_vs_ridership.png'
)
)
chart
CPU times: user 2.16 s, sys: 51.8 ms, total: 2.21 s Wall time: 1.93 s
Notes
- An inflection point is the temperature at which users decide whether or not to use the bike share service (i.e. whether or not to take a trip using bike share).
- The y-axis is shown on a log-scale in order to better visualize changes in ridership as temperature is increased.
- Each of the eight subplots is displaying 365 points for Annual members and 365 points for Casual members.
Observations
- 2018 and 2019
- Annual is more dispersed, especially at lower temperatures.
- Annual ridership is higher than casual, as was also seen in monthly, hourly and day-of-the-week charts.
- Casual ridership shows an inflection point that was visually estimated and is shown with a red arrow for both station types.
- A Casual ridership cluster (higher concentration of points) appears at bottom left (near the inflection point) and at the top right (centered at ~30C).
- For Annual ridership, a similar cluster appears at top right only.
- Top-performing stations show faster growth in Casual ridership above the inflection point. We can see this from the stronger slope in the dark green points before and after the inflection point.
- 2020, 2021, 2022
- The Casual ridership scatter of points is less dispersed.
- The Casual and Annual ridership scatter of points strongly overlap with each other.
- Casual ridership shows an inflection point that is reducing in terms of temperature. Ridership is becoming more resilient to lower maximum daily temperatures.
- For Annual and Casual ridership, a cluster appears at the top right only.
- Shape of Casual ridership plot for top-performers in 2022 is 180-deg rotated compared to 2019.
- From 2019 to 2022, the separation between the temperature dependence of Casual and Annual ridership has reduced until it was eventually almost absent. In 2022, the temperature dependences for both types of users show strong similarities to each other.
Summary of Metrics Used
- average number of daily departures (trips) per station, user type and station type for each year and month
Get the average daily trips per station for each year and month in all available data per station type overall (across both user types), combined with the maximum daily temperature in the city
- get daily departures (trips) per station, year and month in all available data (2018 to 2023)
- combine with table indicating if table is a top-performer
- fill any missing values in the
is_top_performercolumn after combining in step 2. - get daily departures (trips) per station type and calculate average number of daily departures (trips) per station for each day by taking the ratio of total daily trips to the total stations used per day
- combine with min and max daily temperature from weather data
%%time
query = f"""
WITH t1 AS (
-- 1. get number of daily trips per station, user type, year and month in all data
SELECT start_station_id AS station_id,
started_at_year AS year,
started_at_month AS month,
started_at_day AS day,
COUNT(DISTINCT(trip_id)) AS trips
FROM read_parquet({fpaths_proc_all})
WHERE started_at_year <= 2022
GROUP BY all
),
-- 2. LEFT JOIN with table indicating if station is a top-performer
t2 AS (
SELECT *
FROM t1
LEFT JOIN (
SELECT station_id,
is_top_perform_station
FROM df_stations
) USING (station_id)
),
-- 3. fill any missing values in the is_top_perform_station column after LEFT JOIN
-- with FALSE
t3 AS (
SELECT --fill missing values in is_top_perform_station with FALSE
* EXCLUDE(is_top_perform_station),
COALESCE(is_top_perform_station, NULL, False) AS is_top_perform_station
FROM t2
),
-- 4. calculate average daily trips per station, station type, year and month
t4 AS (
SELECT -- change the top-performing station column from a boolean to a string
(
CASE WHEN is_top_perform_station = True
THEN 'top-performer'
ELSE 'others' END
) AS station_type,
year,
month,
day,
make_date(year, month, day) AS date,
COUNT(DISTINCT(station_id)) AS num_stations_used,
SUM(trips) AS trips,
SUM(trips)/num_stations_used AS avg_trips_per_station
FROM t3
GROUP BY ALL
ORDER BY ALL
),
-- 5. LEFT JOIN with minimum and maximum temperature in daily weather data
t5 AS (
SELECT *
FROM t4
LEFT JOIN (
SELECT time as date,
MIN(tmin) AS tmin,
MAX(tmax) AS tmax
FROM read_parquet({[fpath_weather]})
GROUP BY ALL
) USING (date)
)
SELECT *,
make_date(2021, 10, 1) AS corporate_start
FROM t5
ORDER BY station_type, date
"""
df_daily_overall = run_sql_query(query).convert_dtypes()
df_daily_overall = (
df_daily_overall
.assign(
tmin=lambda df: df['tmin'].ffill(),
tmax=lambda df: df['tmax'].ffill(),
)
)
pu.show_df(df_daily_overall)
| column | station_type | year | month | day | date | num_stations_used | trips | avg_trips_per_station | tmin | tmax | corporate_start |
|---|---|---|---|---|---|---|---|---|---|---|---|
| dtype | string[python] | Int32 | Int32 | Int32 | datetime64[us] | Int64 | Int64 | Float64 | Float64 | Float64 | datetime64[us] |
| nunique | 2 | 5 | 12 | 31 | 1826 | 415 | 2951 | 3611 | 401 | 439 | 1 |
| missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | others | 2018 | 1 | 1 | 2018-01-01 | 83 | 128 | 1.542169 | -21.3 | -8.7 | 2021-10-01 |
| 1 | others | 2018 | 1 | 2 | 2018-01-02 | 143 | 515 | 3.601399 | -13.1 | -7.8 | 2021-10-01 |
| 2 | others | 2018 | 1 | 3 | 2018-01-03 | 139 | 622 | 4.47482 | -13.5 | -6.3 | 2021-10-01 |
| 3 | others | 2018 | 1 | 4 | 2018-01-04 | 141 | 619 | 4.390071 | -20.5 | -8.9 | 2021-10-01 |
| 4 | others | 2018 | 1 | 5 | 2018-01-05 | 123 | 393 | 3.195122 | -23.0 | -15.0 | 2021-10-01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3647 | top-performer | 2022 | 12 | 27 | 2022-12-27 | 92 | 1244 | 13.521739 | -7.0 | -2.0 | 2021-10-01 |
| 3648 | top-performer | 2022 | 12 | 28 | 2022-12-28 | 94 | 1827 | 19.43617 | -2.0 | 5.0 | 2021-10-01 |
| 3649 | top-performer | 2022 | 12 | 29 | 2022-12-29 | 96 | 2352 | 24.5 | 2.0 | 9.2 | 2021-10-01 |
| 3650 | top-performer | 2022 | 12 | 30 | 2022-12-30 | 96 | 2120 | 22.083333 | 8.0 | 13.0 | 2021-10-01 |
| 3651 | top-performer | 2022 | 12 | 31 | 2022-12-31 | 95 | 1130 | 11.894737 | 4.0 | 10.0 | 2021-10-01 |
3652 rows × 11 columns
CPU times: user 10.7 s, sys: 532 ms, total: 11.2 s Wall time: 1.15 s
Show above as a chart
%%time
chart = vzu.plot_multi_axis_line_chart_grid(
df_daily_overall,
xvar='yearmonthdate(date):T',
line_colors={'top-performer': 'darkgreen', 'others': '#31a354'},
y1var='avg_trips_per_station:Q',
y2var_min='tmin:Q',
y2var_max='tmax:Q',
y_title_text='Daily Ridership per Station',
y2_title_text='Temperature Range (C)',
title_text='Total Daily Ridership is Maximized During Warmest Months',
sub_title_text_color="#7f7f7f",
sub_title_vertical_padding=5,
shading_opacity=0.85,
shading_color='lightgrey',
axis_label_fontsize=14,
fig_size=dict(width=700, height=360),
)
chart.save(
os.path.join(
figures_dir, '14_daily_max_temperature_and_ridership.png'
)
)
chart
CPU times: user 453 ms, sys: 4.4 ms, total: 457 ms Wall time: 333 ms
Use R2 to calculate the yearly correlation between daily overall ridership and maximum daily temperature
df_daily_overall_r2 = (
df_daily_overall
.groupby(['station_type', 'year'], as_index=False)
.apply(lambda df: linregress(df['trips'], df['tmax'])[2], include_groups=False)
.rename(columns={None: 'y'})
)
df_daily_overall_r2
| station_type | year | y | |
|---|---|---|---|
| 0 | others | 2018 | 0.848727 |
| 1 | others | 2019 | 0.877885 |
| 2 | others | 2020 | 0.786010 |
| 3 | others | 2021 | 0.846501 |
| 4 | others | 2022 | 0.898041 |
| 5 | top-performer | 2018 | 0.874282 |
| 6 | top-performer | 2019 | 0.889999 |
| 7 | top-performer | 2020 | 0.793471 |
| 8 | top-performer | 2021 | 0.845074 |
| 9 | top-performer | 2022 | 0.896529 |
Show above as a heatmap
%%time
chart = vzu.plot_simple_heatmap(
df_daily_overall_r2,
xvar = 'year:O',
yvar = 'station_type:O',
color_by_col = 'y:Q',
color_scale = 'linear',
annot_color_threshold = 0.8,
grid_linewidth = 0.75,
ptitle_text = 'Daily Trips are Correlated to Max. Daily Temperature',
axis_label_fontsize=14,
fig_size = dict(width=510, height=115),
)
chart.save(
os.path.join(
figures_dir,
'15_daily_max_temperature_ridership_correlation.png'
)
)
chart
CPU times: user 52.9 ms, sys: 4.03 ms, total: 56.9 ms Wall time: 54.3 ms
Notes
- Darker shades of red indicate stronger correlation. Lighter shades of red indicate weaker correlation. Yellow indicates the weakest correlation.
Observations
- The line chart of average daily bike share ridership per station and station type temperature range shows the same pattern of a mid-year peak in daily ridership that was previously seen in monthly ridership. The heatmap shows that there is a strong correlation between daily ridership and maximum daily temperature. This can be inferred from the above point (1.) and other ridership patterns seen above. These are not surprising and are seen at both types of stations.
Summary of Metrics Used
- average number of daily departures (trips) per station and station type for each year and month
Recommendations¶
Based on insights into the temporal bike share patterns, it is recommended to focus on displaying ads on the faces of bike share stations during hours of the day, days of the week and months of the year during which average historical ridership per station was at its peak. Since the peaks are not spikes (or delta functions) and have some non-zero width, a window before and after the observed peak is required.
In terms of station attributes, the Regular station configuration and stations that accept credit card payments capture the majority (at least 80%) of top-performing stations. So, it not necessary to filter stations based on these two attributes.
In terms of overall station performance, it was found that stations which are top-performers overall and on both weekedays and weekends account for approximately 80% of top-performing stations. These stations are preferred over those that were top-performers overall and on one of weekdays and weekends, which would allow consistency in when to display ads as opposed to guessing which day of the weekend will continue to drive top performance at these stations. So, a filter that only picks up the former is required.
Below are the recommendations based on temporal insights
recommended_hour_filters = {
"weekday_prime": (
"day_of_week IN ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')"
"AND month IN ('May', 'June', 'July', 'August', 'September', 'October')"
"AND hour IN (7, 8, 9, 15, 16, 17, 18)"
"AND user_type IN ('Annual', 'Casual')"
),
"weekend_prime": (
"day_of_week IN ('Saturday', 'Sunday')"
"AND month IN ('May', 'June', 'July', 'August', 'September', 'October')"
"AND hour IN (13, 14, 15,16, 17, 18)"
"AND user_type IN ('Casual')"
),
"weekday_offseason": (
"day_of_week IN ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')"
"AND month IN ('November', 'December')"
"AND hour IN (15, 16, 17)"
"AND user_type IN ('Annual', 'Casual')"
),
}
df_temporal_recommends = (
pd.DataFrame.from_dict(recommended_hour_filters, orient='index')
.transpose()
.convert_dtypes()
)
with pd.option_context('display.max_colwidth', None):
pu.show_df(df_temporal_recommends)
| column | weekday_prime | weekend_prime | weekday_offseason |
|---|---|---|---|
| dtype | string[python] | string[python] | string[python] |
| nunique | 1 | 1 | 1 |
| missing | 0 | 0 | 0 |
| 0 | day_of_week IN ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')AND month IN ('May', 'June', 'July', 'August', 'September', 'October')AND hour IN (7, 8, 9, 15, 16, 17, 18)AND user_type IN ('Annual', 'Casual') | day_of_week IN ('Saturday', 'Sunday')AND month IN ('May', 'June', 'July', 'August', 'September', 'October')AND hour IN (13, 14, 15,16, 17, 18)AND user_type IN ('Casual') | day_of_week IN ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')AND month IN ('November', 'December')AND hour IN (15, 16, 17)AND user_type IN ('Annual', 'Casual') |
Below are the recommendations based on combined temporal and overall station performance insights
recommended_performance_filters = {
"overall_weekday_weekend": (
"is_top_perform_station_weekday = TRUE "
"AND is_top_perform_station_weekend = TRUE "
"AND is_top_perform_station = TRUE"
)
}
df_performance_recommends = (
pd.DataFrame.from_dict(recommended_performance_filters, orient='index')
.transpose()
.convert_dtypes()
)
with pd.option_context('display.max_colwidth', None):
pu.show_df(df_performance_recommends)
| column | overall_weekday_weekend |
|---|---|
| dtype | string[python] |
| nunique | 1 |
| missing | 0 |
| 0 | is_top_perform_station_weekday = TRUE AND is_top_perform_station_weekend = TRUE AND is_top_perform_station = TRUE |
Export temporal recommendations to disk
%%time
fname_prefix = "recommendations_temporal"
_ = df_temporal_recommends.pipe(
flut.load,
processed_data_dir,
fname_prefix,
my_timezone,
True,
)
Exported 1 rows of recommendations_temporal data to /home/jovyan/data/processed/recommendations_temporal__20240321_114636.parquet.gzip CPU times: user 13.2 ms, sys: 68 µs, total: 13.3 ms Wall time: 13 ms
Export combined temporal+performance recommendations to disk
%%time
fname_prefix = "recommendations_performance"
_ = df_performance_recommends.pipe(
flut.load,
processed_data_dir,
fname_prefix,
my_timezone,
True,
)
Exported 1 rows of recommendations_performance data to /home/jovyan/data/processed/recommendations_performance__20240321_114636.parquet.gzip CPU times: user 1.33 ms, sys: 68 µs, total: 1.39 ms Wall time: 1.16 ms
Discussion¶
Conclusions¶
- Since the corporate plan was introduced in late September of 2021, a (behavioural) change in the trend of bike share ridership has occurred. The factors that have contributed to this change in behaviour and that are likely to continue being present moving forward are
- a switch to hybrid working hours (which effects both types of users)
- increased Casual ridership
- increased environmental awareness
- improved cost flexibility (the corporate plan might partly account for this)
- increased bicycle infrastructure (eg. bike lanes) introduced by the city of Toronto
- On a monthly basis, bike share ridership starts increasing in April or May each year, reaches a peak during the middle of each year (July) and then decreases with the strong dropoff after the Canadian Thanksgiving weekend (on October 9, 2023). The same is true for weekdays and weekends. On weekdays, average monthly ridership per station by Annual members has stayed relatively unchanged since 2018 at top-performing stations but is showing a decreasing trend at other stations. Since the corporate plan was offered, Casual ridership now accounts for the majority of average monthly bike share ridership per station on both weekdays and weekends.
- On an hourly basis, weekday Annual ridership is characterized by peaks during the morning and evening commuting hours (8AM and 4PM). The morning peak was much weaker for Casual users from 2018 to 2021. In 2022, this peak sharply increased for Casual users. Peak morning and evening ridership by Casual users now outperforms the same by Annual users. Weekend ridership is characterized by a single weaker and broader peak centered at 4PM. Here, Casual ridership more than doubled Annual hourly ridership in 2022 after being generally similar for the previous two years. These patterns are consistent between top-performing and other stations on weekends. These charts suggest commuter-driven ridership on weekdays and more leisurely usage on weekends. In summary, weekday hourly ridership patterns were previously driven by Annual users. However, in 2022, the rise of Casual ridership now means that Annual and Casual members show a similar hourly profile in terms of average ridership per station reflecting hybrid working hours. The rise of Casual ridership seen on a monthly basis is also seen on an hourly basis on weekends, where it is now the dominant contributor to weekend ridership.
- On an intra-week basis, average ridership per station by day of the week by Annual members is showing a weak (at top-performing stations) or more pronounced (at other stations) downward trend since 2020. By comparison, ridership by Casual members has only shown year-over-year growth since 2018. Daily Annual ridership is characterized by stronger demand during the middle of the week (Wednesday or Thursday) than on weekends since 2018. This is contrasted by Casual ridership which increases during the week Monday and reaches a maximum on Saturday, with stronger demand on weekends (Saturday and Sunday) than on weekdays (Monday to Friday) since 2018. In 2022, the lowest average Casual member daily ridership per station is larger than the highest average daily Annual member ridership per station on all days of the week. These patterns are also consistent between top-performing and other stations. This reinforces the observation that Casual ridership is increasing across the network, which was seen prevoius by month and hour.
- There is evidence of increased resilience of Casual ridership to colder daily temperatures. Casual ridership grows at a sharper rate above a cutoff at both types of stations and this cutoff has been qualitatively shifting towards colder maximum daily temperatures. Also, as of 2022, the maximum daily temperature dependence profiles for Annual and Casual members' daily ridership, which were divergent during 2019, now overlap with each other. Finally, it is not surprising that daily ridership is observed to be correlated to maximum daily tempererature.
Summary of Assumptions¶
- Same as in data retrieval and processing step.
Version Information¶
packages = [
'scipy',
'pandas',
'pyarrow',
'duckdb',
'altair',
'vl-convert',
]
print(
watermark(
updated=True,
current_date=True,
current_time=True,
timezone=True,
custom_time="%Y-%m-%d %H:%M:%S %Z",
python=True,
machine=True,
packages=','.join(packages),
)
)
Last updated: 2024-03-21 15:46:36 UTC Python implementation: CPython Python version : 3.11.8 IPython version : 8.22.2 scipy : 1.12.0 pandas : 2.2.1 pyarrow : 15.0.1 duckdb : 0.10.0 altair : 5.2.0 vl-convert: not installed Compiler : GCC 12.3.0 OS : Linux Release : 6.6.10-76060610-generic Machine : x86_64 Processor : x86_64 CPU cores : 12 Architecture: 64bit